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FRONT MATTER 


Abstract 


This thesis addresses the issue of translating mathematical expressions from BIFX to the syntax 
of Computer Algebra Systems (CAS). Over the past decades, especially in the domain of Science, 
Technology, Engineering, and Mathematics (STEM), KIEX has become the de-facto standard 
to typeset mathematical formulae in publications. Since scientists are generally required to 
publish their work, BIFX has become an integral part of today’s publishing workflow. On the 
other hand, modern research increasingly relies on CAS to simplify, manipulate, compute, and 
visualize mathematics. However, existing KIEX import functions in CAS are limited to simple 
arithmetic expressions and are, therefore, insufficient for most use cases. Consequently, the 
workflow of experimenting and publishing in the Sciences often includes time-consuming and 
error-prone manual conversions between presentational KIEX and computational CAS formats. 


To address the lack of a reliable and comprehensive translation tool between KIEX and CAS, 
this thesis makes the following three contributions. 


First, it provides an approach to semantically enhance BIFX expressions with sufficient semantic 
information for translations into CAS syntaxes. This, so called, semantification process analyzes 
the structure of the formula and its textual context to conclude semantic information. The 
research for this semantification process additionally contributes towards related Mathematical 
Information Retrieval (MathIR) tasks, such as mathematical education assistance, math recom- 
mendation and question answering systems, search engines, automatic plagiarism detection, 
and math type assistance systems. 


Second, this thesis demonstrates the first context-aware BIFX to CAS translation framework 
BCT. BCAsT uses the developed semantification approach to transform BIFX expressions 
into an intermediate semantic KTEX format, which is then further translated to CAS based 
on translation patterns. These patterns were manually crafted by mathematicians to assure 
accurate and reliable translations. In comparison, this thesis additionally elaborates a non- 
context aware neural machine translation approach trained on a mathematical library generated 
by Mathematica. 


Third, the thesis provides a novel approach to evaluate the performance for BIFX to CAS 
translations on large-scaled datasets with an automatic verification of equations in digital math- 
ematical libraries. This evaluation approach is based on the assumption that equations in digital 
mathematical libraries can be computationally verified by CAS, if a translation between both 
systems exists. In addition, the thesis provides an in-depth manual evaluation on mathematical 
articles from the English Wikipedia. 


The presented context-aware translation framework BCT increases the efficiency and reliability 
of translations to CAS. Via BCT, we strengthened the Digital Library of Mathematical Functions 
(DLMF) by identifying numerous of issues, from missing or wrong semantic annotations to sign 
errors. Further, via KCAT, we were able to discover several issues with the commercial CAS 
Maple and Mathematica. The fundamental approaches to semantically enhance mathematics 
developed in this thesis additionally contributed towards several related MathIR tasks. For 


xiii 


xiv 


instance, the large-scale analysis of mathematical notations and the studies on math-embeddings 
motivated new approaches for math plagiarism detection systems, search engines, and allow 
typing assistance for mathematical inputs. Finally, MCT translations will have a direct real- 
world impact, as they are scheduled to be integrated into upcoming versions of the DLMF and 
Wikipedia. 


Abstract 


FRONT MATTER 


Zusammenfassung 


Diese Dissertation befasst sich mit der Problematik von Ubersetzungen mathematischer For- 
meln zwischen BIFX und Computeralgebrasystemen (CAS). Im Laufe des digitalen Zeitalters 
wurde BIFX zum Quasistandard für das Schreiben mathematischer Formeln auf dem Computer, 
insbesondere in den Disziplinen Mathematik, Informatik, Naturwissenschaften und Technik 
(MINT). Da Wissenschaftler gemeinhin ihre Arbeit publizieren, ist BIFX zu einem integralen 
Bestandteil moderner Forschung geworden. Gleichermaßen verlassen sich Wissenschaftler 
immer mehr auf die Möglichkeiten moderner CAS, um effektiv mit mathematischen Formeln 
zu arbeiten, zum Beispiel, indem sie diese umformen, lösen oder auch visualisieren. Die mo- 
mentanen Ansätze, welche eine Übersetzung von BIFX zu CAS erlauben, wie beispielsweise 
interne Import-Funktionen einiger CAS, sind jedoch häufig auf einfache arithmetische Aus- 
drücke beschränkt und daher nur wenig hilfreich im realen Arbeitsalltag. Infolgedessen ist die 
Arbeit moderner Wissenschaftler in den MINT Disziplinen häufig geprägt von zeitraubenden 
und fehleranfälligen manuellen Übersetzungen zwischen KIEX und CAS. 


Die vorliegende Dissertation leistet die folgenden Beiträge, um das Problem des Übersetzens 
von mathematischen Ausdrücken zwischen BIEX und CAS zu lösen. 


Zunächst ist KIEX ein Format, welches lediglich die visuelle Präsentation mathematischer Aus- 
drücke kodiert, nicht jedoch deren semantische Informationen. Die semantischen Informationen 
sind jedoch notwendig für CAS, welche keine mehrdeutigen Eingaben erlauben. Daher führt 
die vorliegende Arbeit als ersten Schritt für eine Übersetzung eine sogenannte Semantifizierung 
mathematischer Ausdrücke ein. Diese Semantifizierung extrahiert semantische Informationen 
aus dem Kontext und den Bestandteilen der Formel, um Rückschlüsse auf ihre Bedeutung zu 
ziehen. Da die Semantifizierung eine klassische Aufgabe auf dem Gebiet der mathematischen 
Informationsgewinnung darstellt, leistet dieser Teil der Dissertation auch Beiträge zu verwand- 
ten Themengebieten. So sind die hier vorgestellten Ansätze auch nützlich für pädagogische 
Programme, Frage-Antwort Systeme, Suchmaschinen und die digitale Plagiatserkennung. 


Als zweiten Beitrag, stellt die vorliegende Dissertation das erste kontextbezogene KIEX zu 
CAS Ubersetzungsprogramm vor, genannt BCAT. BCAST nutzt die zuvor eingeführte Seman- 
tifizierung, um KIEX in ein Zwischenformat zu transformieren, welches die semantischen 
Informationen explizit darstellt. Dieses Format wird semantisches BIFX genannt, da es eine 
technische Erweiterung von KIEX ist. Die weitere Übersetzung zu CAS wird durch heuristi- 
sche Übersetzungsmuster für mathematische Funktionen realisiert. Diese Übersetzungsmuster 
wurden in Zusammenarbeit mit Mathematikern definiert, um eine korrekte Übersetzung in 
diesem letzten Schritt zu gewährleisten. Um die Vorzüge einer kontextbezogenen Übersetzung 
besser zu verstehen, stellt diese Arbeit zum Vergleich auch eine Maschinenübersetzung auf 
neuronalen Netzen vor, welche den Kontext einer Formel nicht berücksichtigt. 


Der dritte Beitrag dieser Dissertation führt eine neue Methode zur Evaluierung von mathe- 
matischen Übersetzungen ein, welche es erlaubt, auch eine große Anzahl an Übersetzungen 
auf ihre Korrektheit hin zu überprüfen. Diese Methode folgt dem Ansatz, dass Gleichungen 
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in mathematischen Bibliotheken auch nach der Ubersetzung in ein CAS noch korrekt sein 
miissten. Ist dies nicht der Fall, ist entweder die Ausgangsgleichung, die Ubersetzung, oder 
das CAS fehlerhaft. Hierbei ist zu beachten, dass jede Fehlerquelle einen Mehrwert fiir das 
jeweilige System darstellt. Zusätzlich zu dieser automatischen Evaluierung, erfolgt noch eine 
manuelle Analyse von Übersetzungen auf Basis englischer Wikipedia Artikel. 


Zusammenfassend ermöglicht das kontextbezogene Übersetzungsprogramm BCT eine effizi- 
entere Arbeitsweise mit CAS. Mit Hilfe dieser Übersetzungen konnten auch mehrere Probleme, 
wie falsche Informationen oder Vorzeichenfehler, in der Digital Library of Mathematical Func- 
tions (DLMF) sowie Fehler in den kommerziell vertriebenen CAS Maple und Mathematica 
automatisch aufgedeckt und behoben werden. 


Die hier vorgestellte Grundlagenforschung zum semantischen Anreichern mathematischer 
Ausdrücke, hat zudem etliche Beiträge zu verwandten Forschungsthemen geleistet. Zum Bei- 
spiel hat die Analyse der Verteilung von mathematischen Notationen in großen Datensätzen 
neue Ansätze in der digitalen Plagiatserkennung ermöglicht. Des Weiteren wird zurzeit daran 
gearbeitet, die Übersetzungen von ACT in kommende Versionen von Wikipedia und der DLMF 
zu integrieren. 


Zusammenfassung 
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This thesis addresses the issue of translating mathematical expressions from BIFX to the syntax 
of Computer Algebra Systems (CAS), which is typically a time-consuming and error-prone task 
in the modern life of many researchers. A reliable and comprehensive translation approach 
requires analyzing the textual context of mathematical formulae. In turn, research advances 
in translating KIEX contribute directly towards related tasks in the Mathematical Information 
Retrieval (MathIR) arena. In this chapter, I provide an introduction to the topic. Section 1.1 
introduces my motivation and provides an overview of the problem. Section 1.2 summarizes 
the research gap. In Section 1.3, I define the research objective and research tasks of this thesis. 
Section 1.4 concludes with an outline of the thesis including an overview of the publications 
that contributed to the goals of this thesis and the research path that led to these publications. 


1.1 Motivation & Problem 


Consider a researcher is working on Jacobi polynomials and examines the existing English 
Wikipedia article about the topic!. While she might be familiar with the Digital Library of 
Mathematical Functions (DLMF) [98], a standard resource for Orthogonal Polynomials and 
Special Functions (OPSF), the equation 1.1 from the article might be new to her 


(a,8)/.) _ Tlatn+l) 2 (n\T(a+ßB+n+m+1) (z-1\™ 
En (2) AT m T(a+m-+ 1) ( 2 ) = up 


m=0 


In order to analyze this new equation, e.g., to validate it, she wants to use CAS. CAS are 
powerful mathematical software tools with numerous applications [207]. Today’s most widely 


‘https: //en.wikipedia.org/wiki/Jacobi_polynomials [accessed 2021-10-01]. 
Hereafter, dates follow the ISO 8601 standard. i.e., YYYY-MM-DD. 
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Section 1.1. Motivation & Problem 


Table 1.1: Different representations of a Jacobi polynomial. 


System ' Representation 


Rendered Version | P{®®) (cos(aO)) 
Generic KIEX : P_n°{(\alpha, \beta)}(\cos(a\Theta)) 
Semantic KIEX | \JacobipolyP{\alpha}{\beta}{n}@{\cos@{a\Theta}} 
Maple [36] | JacobiP(n, alpha, beta, cos(a*Theta)) 
Mathematica [393] | JacobiP[n, \[Alpha], \[Beta], Cos[a \[CapitalTheta]]] 
SymPy [252] | jacobi (n,Symbol(’alpha’) ,Symbol(’beta’) ,cos(a*Symbol(’Theta’))) 


used CAS include Maple [36], Mathematica [393], and MATLAB [246]. Scientists use CAS? to 
simplify, manipulate, evaluate, compute, or even visualize mathematical expressions. Thus, 
CAS play a crucial role in the modern era for pure and applied mathematics [8, 184, 207, 262] 
and even found their way into classrooms [237, 363, 365, 389, 390]. In turn, CAS are the perfect 
tool for the researcher in our example to examine the formula further. In order to use a CAS, 
she needs to translate the expression into the correct CAS syntax. 


Table 1.1 illustrates the differences between computable and presentational encodings for a 
Jacobi polynomial. While the rendered version and the BIEX [220] encoding only provide 
visual information, semantic KIEX [403] and the CAS encodings explicitly encode the meaning, 
i.e., the semantics, of the formula. On the one hand, KTpX? has become the de-facto standard 
to typeset mathematics in scientific publications [129, 248, 402], especially in the domain of 
Science, Technology, Engineering, and Mathematics (STEM). On the other hand, computational 
advances make CAS an essential asset in the modern workflow of experimenting and publishing 
in the Sciences. Translating expressions between KIEX and CAS syntaxes is, therefore, a 
typical task in the everyday life of our hypothetical researcher. Despite this common need, no 
reliable translation from a presentational format, such as KIEX, to a computable format, such as 
Mathematica, is available to date. The only option our hypothetical researcher has is to manually 
translate the expression in the specific syntax of a CAS. This process is time-consuming and 
often error-prone. 


Q Problem: No reliable translation from a presentational mathematical format to a 
computable mathematical format exists to date. 


If a translation between BIFX and CAS is so essential, why are there no translation tools 
available? As is often the case in research, the reasons for this are diversified. First, there are 
translation approaches available. Some CAS, such as Mathematica and SymPy, allow to import 
BIEX expressions. Most CAS support at least the Mathematical Markup Language (MathML), 
since it is the current web standard to encode mathematical formulae. With numerous tools 
available to transfer BIFX to MathML [18], a translation from BIFX to CAS syntaxes should 
not be a difficult task. However, none of these available translation techniques are reliable 


"In the sequel, the acronym CAS is used interchangeably with its plural. 
Shttps://www.latex-project.org/ [accessed 2021-10-01] 
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Table 1.2: Examples of Mathematica’s KIEX import function ToExpression["x", TeXForm]. 
Tested with Mathematica [393] v.12.1.1. The second sum in row 8 (marked with ?) is only 
partially correct. Since the second summand contains the summation index n, the second 
summand should be part of the sum. 


BIEX Rendering | Import | Result 
\int_a°b x dx : SP ade | Error © x 
\int_a’b x \mathrm{d}x ' pe ade Error O xX 
\int_a°b x\, dx J? zdz | Integrate[x, {x, a, bt] v 
\int_a”b x\; dx ' [x dx ' Error ' x 
\int_a™b x\, \mathrm{d}x fo ada | Error x 
\int_a”b \frac{dx}{x} ' J? de | Error X 
\sum_{n=0}"N n°2 | D n? : Sum[n72, {n, 0, N}] | v 
\sum_{n=0}°N n’2 + n ' Co n?+n ' Sum[n^2, {n, 0, N}] +n ' Ve 
{n \choose m} (2) Ä Error x 
\binom{n}{m} ' (7) ' Binomial[n, m] nv 


and comprehensive. Table 1.2 illustrates how Mathematica, one of the major proprietary CAS, 
fails to import even simple formulae. Another option is SnuggleTeX [251], a BIFX to MathML 
converter which also supports translations to Maxima [324]. SnuggleTeX fail to translate all 
expressions in Table 1.2. Alternative translations via MathML as an intermediate format perform 
similarly (as we will show later in Section 2.3). 


While the simple cases shown in Table 1.2 could be solved with a more comprehensive and flex- 
ible parser and mapping strategy, such a solution would ignore the real challenge of translating 
mathematics to CAS, the ambiguity. The interpretation of the majority of mathematical expres- 
sions is context-dependent, i.e., the same formula may refer to different concepts in different 
contexts. Take the expressions m(x + y) as an example. In number theory, the expression most 
likely refers to the number of primes less than or equal to x + y. In another context, however, 
it may just refer to a multiplication mx + my. Without considering the context, an appropriate 
translation of this ambiguous expression is infeasible. Today’s translation solutions, however, 
do not consider the context of an input. Instead, they translate the expression based on internal 
decisions, which are often not transparent to a user. 


Table 1.3 shows the results of importing m(x + y) to different CAS. Each CAS in Table 1.3 
interprets 7 as a function call but does not associate it with the prime counting function (nor 
any other predefined function). Only SnuggleTeX translated 7 as the mathematical constant 
to Maxima syntax. However, Maxima does not contain a prime counting function. The CAS 
import functions consider the expression as a generic function with the name 7. Mathematica 
surprisingly links still with the mathematical constant which results in a peculiar behaviour 
for numeric evaluations. The expression N [Pi [x+y]] (numeric evaluation of the imported 
expression) is evaluated to 3.14159[x + y]. Associating the variables x and y with numbers, 
say x,y = 1, would result in the rather odd expression 3.14159]2]. 
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Table 1.3: The results of importing n(x + y) in different CAS. For Maple, a MathML rep- 
resentation was used. Content MathML was not tested, since there is no content dictionary 
available that defines the prime counting function. SnuggleTeX translated the expression to 
the CAS Maxima. The two right most columns show the expected expressions in the context 
of the prime counting function or a multiplication. None of the CAS choose any of the two 
expected interpretations. Note that the prime counting function in Maple can also be written 
with pi (x+y) and requires to pre-load the extra package NumberTheory. Nonetheless, this 
function pi (x+y) is still different to the actual imported expression Pi (x+y). Note further 
that Maxima does not define a prime counting function. 


: Translated | Expected Expression 
System ' Expression ' Number of primes : Multip. 
Maple [36] v.2019 ı Pi(x+y) ' PrimeCounting(x+y) : Pi*(x+y) 
Mathematica [393] v.12.1.1 ' Pi [x+y] | NPrimes [x+y] ' Pi*(x+y) 
SymPy [252] v.1.8 pi(x+y) | primepi (x+y) | pi* (x+y) 
SnuggleTeX [251] v.1.2.2 | %pi*(x+y) | - | Upix (x+y) 


Why do existing translation techniques not allow to specify a context? Mainly because it 
is an open research question of what this context is or needs to be. The exact information 
needs to perform translation to CAS syntaxes, and where to find them is unlcear [11]. Some 
required information is indeed encoded in the structure of the expression itself. Consider a 


simple fraction 1. This expression is context-independent and can be directly translated. The 


expression PA (a) in the context of OPSF is also often unambiguous for general-purpose 


CAS. Since Mathematica supports no other formula with this presentational structure, i.e., 
P followed by a subscript and superscript with paranthesis, Mathematica is able to correctly 
associate PE (e), where e are wildcards, with the function JacobiP. In other cases, the 
immediate textual context of the formula provides sufficient information to disambiguate the 
expression [54, 329]. Consider, an author explicitly declares m(x) as the prime counting function 
right before she uses it with m(x + y). In this case, it might be sufficient to scan the surrounding 
context for key phrases [183, 214, 329], like ‘prime counting function’ in order to map 7 to, for 
instance, NPrimes in Mathematica. 


Often, the semantic explanations of mathematical objects in an article are scattered around in 
the context or absent entirely [394]. An interested reader needs to retrieve sufficient seman- 
tic explanations and correctly link them with mathematical objects in order to comprehend 
the meaning of a complex formula. Sometimes, an author presumes the interpretation of an 
expression can be considered as common knowledge and, therefore, does not require further 
explanations. Consider m(x + y) refers to a multiplication between m and (x + y). In general, 
an author may consider 7 (the mathematical constant) as common knowledge and does not 
explicitly declare its meaning. The same could be true for scientific articles, where the length is 
often limited. An article about prime numbers probably not explicitely declare the meaning of 
a(x + y) because the author presumes the semantics are unambiguis given the overall context 
of the article. 
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In other cases, the information needs go beyond a simple text analysis. Consider m(x + y) 
as a generic function that was previously defined in the article and simply has no name. An 
appropriate translation would require to retrieve the definition of the function from the context. 
But even if a function is well-known and supported by a CAS, a direct translation might be 
inappropriate because the definition in the CAS is not what our researcher expected [3, 13]. 
Legendre’s incomplete elliptic integral of the first kind F (¢, k), for example, is defined with 
the amplitude ¢ as its first argument in the DLMF and Mathematica. In Maple, however, one 
needs to use the sine of the amplitude sin(#) for the first argument’. In turn, an appropriate 
translation to Maple might be EllipticF(sin(phi), k) rather than EllipticF (phi, k) 
depending on the source of the original expression. The English Wikipedia article about elliptic 
integrals’ contains both versions and refers to them with F'(¢,k) and F(a; k) respectively. 
Even though both versions in Wikipedia refer to the same function, correct translations to 
Maple of F(&,k) and F(x; k) are not the same. 


In cases of multi-valued functions, transla- 
tions between different systems can become Table 1.4: Different computation results for 
eminently more complex [83, 91, 172]. Even arccot(—1) (inspired by [84]). 

for simple cases, such as the arccotangent 


function arccot(), the behavior of different System or Source ' arccot(—1) 
CAS might be confusing. For example, since Brest panene 37/4 
arccot(x) is multi-valued, there are multiple ER ! 
solutions of arccot(—1). CAS, like any gen- [276] 9th printing ' -7/4 
eral calculator too, only compute values on Maple [36] v.2020.2 i 3/4 
the principle branches and, therefore, return Mathematica [393] v.12.1.1 i -7/4 
only a single value. The principle branches, ' 

: : SymPy [252] v.1.5.1 ı oA 
however, are not necessarily uniformly po- ' 
sitioned among multiple systems [84, 172]. Axiom [173] v.Aug.2014 !: 37/4 
In turn, the returned value of a multi-valued Reduce [151] v.5865 or /4 
function may depends on the system, see Ta- MATLAB [246] vR2021a | —1/4 


ble 1.4. A translation of arccot(x) from the 
DLMF to arccot (x) in Maple would be only 
correct for Rx > 0. Finally, CAS may also compute irrational looking expressions without 


objections, e.g., arccot (3) returns 1.5708 in MATLAB*. Even for field experts, it can be chal- 
lenging to keep track of every property and characteristic of CAS [20, 100]. 


Q Problem: Existing BIFX to CAS converters are context-agnostic, inflexible, limited 
to simple expressions, and nontransparent. 


In combination, all of the issues underline that an accurate manual translation to the syntax of 
CAS is challenging, time-consuming, error-prone, and requires deep and substantial knowledge 
about the target system. Especially with the increasing complexity of the translated expressions, 
errors during the translation process might be inevitable. Real-world scenarios often include 


“https: //www.maplesoft.com/support/help/maple/view.aspx?path=EllipticF 
[accessed 2021-10-01] 

Shttps://en.wikipedia.org/wiki/Elliptic_integral [accessed 2021-10-01] 

‘MATLAB evaluates 4 to infinity and the limit in positive infinity of the arccotangent function is Ẹ (or roughly 
1.5708). Yet, the interpretation of the division by zero is not wrong, since it follows the official IEEE 754 standard 
for floating-point arithmetic [170]. 
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much more complicated formulae compared to the expressions in Table 1.2 or even equation (1.1). 
Moreover, if an error occurs, the cause of the error can be very challenging to detect and traced 
back to its origin. The issue of translating arccot(x) to Maple, for example, may remain 
undiscovered until a user calculates negative values. If the function is embedded into a more 
complex equation, even experts can lose track of potential issues. In combination with unreliable 
translation tools, working with CAS may even be frustrating. Mathematica, for example, is able 
to import our test expression (1.1) mentioned earlier without throwing an error’. However, 
investigating the imported expression reveals an incorrect translation due to an issue with 
factorials. To productively work with CAS, our hypothetical researcher from above needs to 
carefully evaluate if the automatically imported expression was correct. As a consequence, 
existing translation approaches are not practically useful. 


In this thesis, I will focus on discovering the information needs to perform correct translations 
from presentational formats, here mainly KIEX, to computational formats, here mainly CAS 
syntaxes. My personal motivation is to improve the workflow of researchers by providing them a 
reliable translation tool that offers crucial additional information about the translation process. 
Further, I limit the support of such a translation tool to general-purpose CAS, since many 
general mathematical expressions simply cannot be translated to appropriate CAS expressions 
for task-specific CAS (or other mathematical software, such as theorem provers). The focus on 
general-purpose CAS allows me to provide a broad solution to a general audience. Note further 
that, in this thesis, I mostly focus on the two major CAS Maple and Mathematica. However, 
the goal is to provide a translation tool that is easy to extend and support more CAS. 


Further, the real-world applications of such a translation tool go far beyond an improved work- 
flow with CAS. A computable formula can be automatically verified with CAS [51, 52, 2, 
8, 13, 153, 184, 414, 415], translated to other semantically enhanced formats, such as Open- 
Math [53, 57, 119, 152, 303, 361], content MathML [59, 60, 159, 270, 318, 342] or other CAS 
syntaxes [110, 361], imported to theorem prover [35, 57, 152, 163, 338, 375], or embedded in 
interactive documents [85, 131, 150, 162, 201, 284]. Since an appropriate translation is generally 
context-dependent, a translator must use MathIR [141] techniques to access sufficient semantic 
information. Hence, advances in translating BIFX to CAS syntaxes also contribute directly 
towards related MathIR tasks, including entity linking [150, 208, 212, 316, 319, 321, 322], math 
search engines [92, 181, 182, 203, 211, 236, 274], semantic tagging of math formulae [71, 402], 
recommendation systems [30, 31, 50, 319], type assistance systems [103, 106, 14, 321, 400], and 
even plagiarism detection platforms [253, 254, 334]. 


1.2 Research Gap 


Existing translation approaches from presentational formats to computable formats share the 
same issues. Currently, these translation approaches are 


1. context-independent, i.e., a translation of an expression is unique regardless of the context 
from where the expression came from (see the r(x + y) example mentioned earlier); 


2. nontransparent, i.e., the internal translation decisions are not communicated to the user, 
which makes the translation untrustworthy and errors challenging to trace or detect; 


"If the binomial is given with the \binom macro rather than \choose. 
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3. inflexible, i.e., slight changes in the notation can cause the translation to fail (see the 
integral imports from Table 1.2); and 


4. limited to simple expression due to missing mappings between function definition sources, 
i.e., even with semantic information, a translation often fails. 


Issue 4 raises from the fact that there are semantically enhanced data formats that have been 
specifically developed to make expressions between CAS interchangeable, such as Open- 
Math [119, 303, 361] and content MathML [318, 343]. Nonetheless, most CAS do not support 
OpenMath natively [303] and the support for content MathML is limited to school mathemat- 
ics [318]. The reason is that such translation requires a database that maps functions between 
different semantic sources. As discussed above, creating such a comprehensive database can be 
time-consuming due to slight differences between the systems (e.g., positions of branch cuts, 
different supported domains, etc.) [361]. Hence, for economic reasons, crafting and maintaining 
such a library is unreasonable. Translations between semantic enhanced formats, e.g., between 
CAS syntaxes, OpenMath, or content MathML, are consequentially often unreliable. 


In previous research, I was focusing on the issues 2-4 by developing a rule-based BIFX to 
CAS translator, called BCssT. Originally, ACT performs translations from semantic BIFX to 
Maple. Relying on semantic BIFX allows BCAsT to largely ignore the ambiguity Issue 1 and 
focus on the other problems. For this thesis, I continued to develop BCT to further mitigate 
the limitation and inflexibility issues 3 and 4. Further, I focused on extending BCAT to become 
the first context-aware translator to tackle the context-independency issue 1. 


1.3 Research Objective 


This doctoral thesis aims to: 


O Research Objective 


Develop and evaluate an automated context-sensitive process that makes presentational 
mathematical expressions computable via computer algebra systems. 


Hereafter, I consider the semantic information of a mathematical expression as sufficient if a 
translation of the expression into the syntax of a CAS becomes feasible. To achieve the research 
objective, I define the following five research tasks: 


O Research Tasks 


I Analyze the strengths and weaknesses of existing semantification approaches for 
translating mathematical expressions to computable formats. 


II Develop a semantification process that will improve on the weaknesses of current 
approaches. 


III Implement a system for the automated semantification of mathematical expressions 
in scientific documents. 


IV Implement an extension of the system to provide translations to computer algebra 
systems. 


V Evaluate the effectiveness of the developed semantification and translation system. 
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1.4 Thesis Outline 


Chapter 1 provides an introduction for translating presentational mathematical expressions 
into computable formats. The chapter further defines the research gap for such translations and 
defines the research objective and tasks this thesis addresses. Finally, it outlines the structure 
of the thesis and briefly summarizes the main publications. 


Chapter 2 provides an overview of related work by examining existing mathematical formats 
and translation approaches between them. This chapter focuses on Research Task I by ana- 
lyzing the strengths and weaknesses of existing translation approaches with the main focus on 
the standard formats KIEX and MathML. 


Chapter 3 addresses Research Task II by studying the capability of math embeddings, intro- 
ducing a new concept to describe the nested structure of mathematical objects, and presenting 
a novel context-sensitive semantification process for BIFX expressions. 


Chapter 4 presents the first context-sensitive BIFX to CAS translator: BCAsT. In particular, this 
chapter focuses on Research Tasks III and IV by implementing the previously introduced 
semantification process and integrates it into the rule-based semantic KIEX to CAS translator 
BCs. In addition, the chapter briefly summarizes a context-independent neural machine 
translation approach to estimate how much structural information is encoded in mathematical 
expressions. 


Chapter 5 evaluates the new translation tool ACAST and, therefore, contributes mainly towards 
Research Task V. In particular, it introduces the novel evaluation concept of equation veri- 
fications to estimate the appropriateness of translated CAS expressions. Our new evaluation 
concept not only detects issues in the translation pipeline but is also able to identify errors 
in the source equation, e.g., from the DLMF or Wikipedia, and the target CAS, e.g., Maple or 
Mathematica. In order to maximize the number of verifiable DLMF equations via our novel eval- 
uation technique, this chapter also introduces some heuristic extensions to the ACAST pipeline. 
Hence, this chapter partially contributes to Research Task IV too. 


Chapter 6 concludes the thesis by summarizing contributions and their impact on the MathIR 
community. It further provides a brief overview of the remaining issues and future work. 


An Appendix is available in the electronic supplementary material and provides additional 
information about certain aspects of this thesis including an extended error analysis, result 
tables, and a summary of bugs and issues we discovered with the help of CasT in the DLMF, 
Maple, Mathematica, and Wikipedia. 


1.4.1 Publications 


Most parts of this thesis were published in international peer-reviewed conferences and journals. 
Table 1.5 provides an overview of the publications that are reused in this thesis. The first column 
identifies the chapter a publication contributed to. The venue rating was taken from the Core 
ranking? for conferences and the Scimago Journal Rank (SJR)? for journal articles. Each rank 


*http : // portal . core . edu. au / conf - ranks/ with the ranks: A* - flagship conference (top 5%), 
A - excellent conference (top 15%), B - good conference (top 27%), and C - remaining conferences [accessed 
2021-10-01]. 

https: //www.scimagojr.com/ with the ranks Q1 - Q4 where Q1 refer to the best 25% of journals in the 
field, Q2 to the second best quarter, and so on [accessed 2021-10-01]. 


Chapter 1 
Introduction 


Section 1.4. Thesis Outline 


was retrieved for the year of publication (or year of submission, in case the paper has not been 
published yet). Table 1.6 similarly shows publications that partially contributed towards the goal 
of this thesis but are not reused within a chapter. Note that the publication [3] (in Table 1.6) was 
part of my Master’s thesis and contributed towards this doctoral thesis as a preliminary project. 
The Journal publication [13] (also in Table 1.6) is an extended and (with new results) updated 
version of the thesis and the mentioned article [3]. The venue abbreviations in both tables are 
explained in the glossary. Lastly, note that the TPAMI journal [11] is reused in Chapter 4 (for 
the methodology) and in Chapter 5 (for the evaluation) to provide a coherent structure. My 
publications, talks, and submissions are separated from the general bibliography in the back 
matter and can be found on page 171. 


Table 1.5: Overview of the primary publications in this thesis. 


i l i ! ' Author | Venue | 


Ch. : Venue ' Year ' Type ' Length : Position ' Rating ' Ref. 
5 SIGIR | 2019 | Workshop | Full | 10f6 l Core A* | [9] 
' JCDL ' 2018 ' Conference ' Full ' 2of6 ' Core A* : [18] 
' Scientometrics ' 2020 ' Journal ' Full ' 1of7 ‘SJR Q1 i [15] 
3 |: WWW ' 2020 ' Conference ' Full ' 1of7 ' Core A* ' [14] 
‘ ICMS : 2020 ' Conference ' Full ' 1of4 ‘n/a ' [10] 
4 ' TPAMI ' 2021 ' Journal ‘Full : 1of6 'SRQ1 ' [11] 
' TACAS ' 2021 ! Conference ' Full ' 1of8 :CoreA : [8] 
' CICM ' 2018 ' Conference ' Full ' 2of3 ' n/a ' [2] 
6 | JCDL | 2020 | Conference | Poster ; 20f5 ; Core A* | [17] 


Table 1.6: Overview of secondary publications that partially contributed to this thesis. 


' Author ' Venue | 


Year | Venue : Type ` Length : Position : Rating | Ref. 
en ı CLEF : Workshop : Full ' 4of6 : n/a : [16] 

' EMNLP i Workshop ' Full ' 2of4 i Core A ' [1] 
2019 | AJIM : Journal , Full ı 1of4 . SJRO1: [13] 
2018 ; CICM ; Conference ; Short : 1of4 : n/a : [12] 
2017 | CICM | Conference | Full ı 40f9 ı n/a : BJ 


1.4.2 Research Path 


This section provides a brief overview of my research path that led to this thesis, i.e., it discusses 
the primary publications and the motivations behind them. Every publication is marked with 
the associated chapter and a reference. This research path is logically (not chronologically) 
divided into three sections: preliminary work, the semantification of KIEX, and the evaluation 
of translations. 


Preliminary Work I had the first contact with the problem of translating KIEX to CAS 
syntaxes during my undergraduate studies in mathematics. During that time, I regularly used 


The methodology part of this journal is reused in Chapter 4 while the evaluation part is reused in Chapter 5. 
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CAS like MATLAB and SymPy for numeric simulations and for plotting results. At the same 
time, we were required to hand in our homework as BIFX files. While exporting content from the 
CAS to BIFX files was rather straight forward, the other way around, i.e., importing KIEX into 
the CAS, required manual conversions. I decided to explore the reasons for this shortcoming in 
my Master’s thesis. During that time, I developed the first version of a semantic KIEX to CAS 
translator, which was later coined BCT". The results from this first study were published at 
the Conference of Intelligent Computer Mathematics (CICM) in 2017. 


“Semantic Preserving Bijective Mappings of Mathematical Formulae Between 
Document Preparation Systems and Computer Algebra Systems” by Howard 
S. Cohl, Moritz Schubotz, Abdou Youssef, André Greiner-Petter, Jiirgen 
Gerhard, Bonita Saunders, Marjorie McClain, Joon Bang, and Kevin Chen. In: 
Proceedings of the International Conference of Intelligent Computer Mathematics 
(CICM), 2017. 


Not Reused — [3] 


This first version of PCAS focused specifically on the CAS Maple but was designed modularly 
to allow later extensions to other CAS. The main limitation of ACAT, however, was the re- 
quirement of using semantic BIEX macros to disambiguate mathematical expressions manually. 
An automatic disambiguation process did not exist at the time. Moreover, only a few previous 
projects focused on a semantification for translating mathematical formats. Hence, I continued 
my research in this direction. 


In the following, I will use ‘we’ rather than T in the subsequent parts of this thesis, since none 
of the presented contributions would have been possible without the tremendous and fruitful 
discussions and help from advisors, colleagues, students, and friends. 


Semantification of KTEX As an alternative for semantic BIFX, we closely investigated exist- 
ing converters for MathML first (see Section 2.2.1). Since MathML was (and still is) the standard 
encoding for mathematical expressions in the web, most CAS support MathML. MathML uses 
two markups, presentation and content MathML. The former visualizes a formula, while the 
latter describes the semantic content. Hence, content MathML can disambiguate math much 
like semantic TEX. Since MathML is the official web standard and BIFX the de-facto standard 
for writing math, there are numerous of converters available that translate KIEX to MathML. 
As our first contribution, we developed MathMLben, a benchmark dataset for measuring the 
quality of MathML markup that appears in a textual context. With this benchmark, we evaluated 
nine state-of-the-art BIX to MathML converters, including Mathematica as a major CAS. We 
published our results in the Joint Conference on Digital Libraries (JCDL) in 2018. 


“Improving the Representation and Conversion of Mathematical Formulae by 
Considering their Textual Context” by Moritz Schubotz, André Greiner- 
Petter, Philipp Scharpf, Norman Meuschke, Howard S. Cohl, and Bela Gipp. 
In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries 
(JCDL), 2018. 


Chapter 2 — [18] 


" LaTeX to CAS Translator. 
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We discovered that three of the nine tools were able to generate content MathML but with 
insufficient accuracy. None of the available tools were capable of analyzing a context for a 
given formula. Hence, the converters were unable to conclude the correct semantic information 
for most of the symbols and functions. In our study, we proposed a manual semantification 
approach that semantically enriches the translation process of existing converters by feeding 
them semantic information from the surrounding context of a formula. The enrichment process 
was manually illustrated via the converter KIExmr, which allowed us to add custom semantic 
macros to improve the generated MathML data. In fact, we used this manual approach to create 
the entries of MathMLben in the first place. 


Naturally, our next goal was to automatically retrieve semantic information from the context 
of a given formula. Around this time, word embeddings [256] began to gain interest in the 
MathIR community [121, 215, 242, 400, 404]. It seems that vector representations were able to 
capture some semantic properties of tokens in natural languages. Can we create such semantic 
vector representations of mathematical expressions too? Unfortunately, we discovered that 
the related work in this new area of interest did not discuss a crucial underlying issue with 
embedding mathematical expressions. In math expressions, certain symbols or entire groups of 
tokens are fixed, such as the red tokens in the Gamma function I (x) or the Jacobi polynomial 
P,‘°-) (x), while other may vary (gray). Inspired by words in natural languages, we call these 
fixed tokens the stem of a mathematical object or operation. Unfortunately, in mathematics, this 
stem is context-dependent. If m is a function, the red tokens are its stem T(x + y). However, 
if 7 is not a function, the stem is just the symbol itself r(x + y). If we do not know the stem 
of a mathematical object, how can we group them so that a trained model understands the 
connection between variations like T(z) and T(x)? The answer is: we cannot. The only 
alternative is to use context-independent representations, e.g., we only embed the identifiers or 
the entire expression. Each of these approaches has advantages and disadvantages. We shared 
our discussion with the community at the BIRNDL Workshop at the conference on Research 
and Development in Information Retrieval (SIGIR) in 2019. 


“Why Machines Cannot Learn Mathematics, Yet” by André Greiner-Petter, 
Terry Raus, Moritz Schubotz, Akiko Aizawa, William I. Grosky, and Bela 
Gipp. In: Proceedings of the 4th Joint Workshop on Bibliometric-Enhanced 
Information Retrieval and Natural Language Processing for Digital Libraries 
(BIRNDL@SIGIR), 2019. 


Chapter 2 — [9] 


Nonetheless, context-independent math embeddings still have many valuable applications. 
Search engines, for example, can profit from a vector representation that represents a mathe- 
matical expression in a particular context. Such a trained model would still be unable to tell us 
what the expression is, but it can tell us efficiently if the expression is semantically similar (e.g., 
because the surrounding text is similar) to another expression. Further, embedding semantic 
KIRX allows us to overcome the issue of unknown stems for most functions since the macro 
unambiguously defines the stem. Youssef and Miller [404] trained such a model on the DLMF 
formulae. Later, we published an extended version of our workshop paper together with Youssef 
and Miller in the Scientometrics journal. 
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“Math-Word Embedding in Math Search and Semantic Extraction” by An- 
dré Greiner-Petter, Abdou Youssef, Terry Raus, Bruce R. Miller, Moritz 
Schubotz, Akiko Aizawa, and Bela Gipp. In: Scientometrics 125(3): 3017-3046, 
2020. 


Chapter 3 — [15] 


Unfortunately, this sets us back to the beginning, where we need manually crafted semantic 
BIFX. We started to investigate the issue of interpreting the semantics of mathematical expres- 
sions from a different perspective. As we will see later in Section 2.2.4, humans tend to visualize 
mathematical expressions in a tree structure, where operators, functions, or relations are parent 
nodes of their components. Identifiers and other terminal symbols are the leaves of these trees. 
The MathML tree data structure comes close to these so-called expression trees (see Section 2.2.4) 
but does not strictly follow the same idea [331]. The two aforementioned context-independent 
approaches to embed mathematical expressions take either the leaves or the roots of such trees. 
The subtrees in between are the context-dependent mathematical objects we need. Not all 
subtrees, however, are meaningful, and the mentioned expression trees are only theoretical 
interpretations. In searching for an approach to discover meaningful subexpressions, which we 
call Mathematical Objects of Interest (MOI), we performed the first large-scale study of mathe- 
matical notations on real-world scientific articles. In this study, we followed the assumption 
that every subexpression with at least one identifier can be semantically important. Hence, we 
split every formula into their MathML subtrees and analyzed their frequency in the corpora. 
Overall, we analyzed over 2.5 Billion subexpressions in 300 Million documents and showed 
that the frequency distribution of mathematical subexpressions is similar to words in natural 
language corpora. By applying known frequency-based ranking functions, such as BM25, we 
were also able to discover topic-relevant notations. We published these results at The Web 
Conference (WWW) in 2020. 


“Discovering Mathematical Objects of Interest — A Study of Mathematical No- 
tations” by André Greiner-Petter, Moritz Schubotz, Fabien Miller, Corinna 
Bretinger, Howard S. Cohl, Akiko Aizawa, and Bela Gipp. In: Proceedings of 
the Web Conference (WWW), 2020. 
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The applications that we derived from simply counting mathematical notations were surpris- 
ingly versatile. For example, with the large set of indexed math notations, we implemented the 
first type assistant system for math equations, developed a new faceted search engine for zb- 
MATH, and enabled new approaches to measure potential plagiarism in equations. Besides these 
practical applications, it also gave us the confidence to continue focusing on subexpressions for 
our KIEX semantification. Previous projects that aimed to semantically enrich mathematical 
expressions with information from the surrounding context primarily focused on one of the 
earlier mentioned extremes, i.e., the leaves or roots in expression trees [139, 214, 279, 329, 330]. 
Our study also revealed that the majority of unique mathematical formulae are neither single 
identifier nor highly complex mathematical expressions. Hence, we concluded that we should 


Chapter 1 
Introduction 


Section 1.4. Thesis Outline 


focus on semantically enriching subexpressions (subtrees) rather than the roots or leaves. We 
proposed a novel context-sensitive translation approach based on semantically annotated MOI 
and shared our theoretical concept with the community at the International Conference on 
Mathematical Software (ICMS) in 2020. 


“Making Presentation Math Computable: Proposing a Context Sensitive Ap- 
proach for Translating LaTeX to Computer Algebra Systems” by André 
Greiner-Petter, Moritz Schubotz, Akiko Aizawa, and Bela Gipp. In: Pro- 
ceedings of the International Conference on Mathematical Software (ICMS), 
2020. 
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Afterward, we started to realize the proposed pipeline with a specific focus on Wikipedia. We 
focused on this encyclopedia for two reasons. First, Wikipedia is a free and community-driven 
encyclopedia and, therefore, (a) less strict on writing styles and (b) more descriptive compared to 
scientific articles. Second, Wikipedia can actively benefit from our contribution since additional 
semantic information about mathematical formulae can support users of all experience levels 
to read and comprehend articles more efficiently [150]. Moreover, a successful translation from 
a formula in Wikipedia to a CAS makes the formula computable which enables numerous of 
additional applications. In theory, a mathematical article could be turned into an interactive 
document to some degree with our translations. However, the most valuable application of a 
translation of formulae in Wikipedia would be the ability to check equations for their plausi- 
bility. With the help of CAS, we are able to analyze if an equation is semantically correct or 
suspicious. This evaluation would enable existing quality measures in Wikipedia to incorporate 
mathematical equations for the first time. The results from our novel context-sensitive transla- 
tor including the plausibility check algorithms have been accepted for publication in the IEEE 
Transactions on Pattern Analysis and Machine Intelligence (TPAMI) journal and are currently 
in press. 


“Do the Math: Making Mathematics in Wikipedia Computable.” André 
Greiner-Petter, Moritz Schubotz, Corinna Bretinger, Philipp Scharpf, Akiko 
Aizawa, and Bela Gipp. In press: IEEE Transactions on Pattern Analysis and 
Machine Intelligence (TPAMI), 2021. 
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Currently, we are also actively working on extending the backbone of Wikipedia itself for 
presenting additional semantic information about mathematical expressions by hovering over 
or clicking on the formula. This new feature helps Wikipedia users to better understand the 
meaning of mathematical formulae by providing details on the elements of formulae. Moreover, 
it paves the way towards an interface to actively interact with mathematical content in Wikipedia 
articles. We presented our progress and discussed our plans in the poster session at the JCDL 
in 2020. 
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= “Mathematical Formulae in Wikimedia Projects 2020.” Moritz Schubotz, André 
= Greiner-Petter, Norman Meuschke, Olaf Teschke, and Bela Gipp. In: Poster 
: Session at the ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2020. 
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Evaluating Digital Mathematical Libraries Alongside this main research path, we contin- 
uously improved and extended ACT with new features and new supported CAS. Our first goal 
was to verify the translated, now computable, formulae in the DLMF. The primary motivation 
behind this approach was to quantitatively measure the accuracy of ACAST translations. How 
can we very if a translation was correct? The well-established Bilingual Evaluation Understudy 
(BLEU) [282] measure in natural language translations is not directly applicable for mathemati- 
cal languages because an expression may contain entirely different tokens but is still equivalent 
to the gold standard. Since the translation is computable, however, we can take advantage of 
the power of CAS to verify a translation. The basic idea is that a human-verified equation in 
one system must remain valid in the target system. If this is not the case, only three sources 
of errors are possible: either the source equation, the translation, or the CAS verification was 
incorrect. With the assumption that equations in the DLMF and major proprietary CAS are 
mostly error-free, we can translate equations from the DLMF to discover issues within BCT. 
First, we focused on symbolic verifications, i.e., we used the CAS to symbolically simplify the 
difference between left- and right-hand side of an equation. If the simplified difference is 0, 
the CAS symbolically verified the equivalence of the left- and right-hand side and confirmed a 
correct translation via ACAST. Additionally, we extended the verification approach to include 
more precise numeric evaluations. If a symbolic manipulation failed to return 0, it could also 
mean the CAS was unable to simplify the expression. We numerically calculate the difference 
on specific test values and check if the difference is below a given threshold to overcome this 
issue. If all test calculations are below the threshold, we consider it numerically verified. Even 
though this approach cannot verify equivalence, it is very effective in discovering disparity. We 
published the first paper with this new verification approach based on Maple at the CICM in 
2018. 


“Automated Symbolic and Numerical Testing of DLMF Formulae Using Com- 
puter Algebra Systems” by Howard S. Cohl, André Greiner-Petter, and 
Moritz Schubotz. In: Proceedings of the International Conference on Intelligent 
Computer Mathematics (CICM), 2018. 
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The extension of the system and the new results led us to an extended journal version of the 
initial ACAsT publication [3]. This extended version mostly covered parts of my Master’s thesis 
and is not reused in this thesis. For technical details about BCT, see the journal publication [13]. 
In Appendix D available in the electronic supplementary material, we summarized all significant 
issues and reported bugs we discovered via ACssT. The section also includes new issues that we 
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discovered during the work on the journal publication. This journal version was published in 
the Aslib Journal of Information Management in 2019. 


“Semantic preserving bijective mappings for expressions involving special func- 
tions between computer algebra systems and document preparation systems” 
by André Greiner-Petter, Howard S. Cohl, Moritz Schubotz, and Bela Gipp. 
In: Aslib Journal of Information Management 71(3): 415-439, 2019. 


Appendix D — [13] 


It turned out that BCT translations on semantic KIRX were so stable that we can use the 
same approach for verifying translations also to specifically search for errors in the DLMF 
and issues in CAS. To maximize the number of supported DLMF formulae, we implemented 
additional heuristics to BCasT, such as a logic to identify the end of a sum or to correctly 
interpret prime notations as derivatives. Additionally, we added support for translations to 
Mathematica and SymPy. We extended the support for Mathematica even further to perform 
the same verifications in Maple also in Mathematica. The Mathematica support finally allows 
us to identify computational differences in two major proprietary CAS. Moreover, we extended 
the previously introduced symbolic and numeric evaluation pipeline with more sophisticated 
variable extraction algorithms, more comprehensive numeric test values, resolved substitutions, 
and improved constraint-awareness. All discovered issues are summarized in Appendix D 
available in the electronic supplementary material. We further made all translations of the 
DLMF formulae publicly available, including the symbolic and numeric verification results. The 
results of this recent study have been published at the international conference on Tools and 
Algorithms for the Construction and Analysis of Systems (TACAS). 


“Comparative Verification of the Digital Library of Mathematical Functions 
and Computer Algebra Systems” by André Greiner-Petter, Howard S. Cohl, 
Abdou Youssef, Moritz Schubotz, Avi Trost, Rajen Dey, Akiko Aizawa, and 
Bela Gipp. In: Tools and Algorithms for the Construction and Analysis of 
Systems (TACAS), 2022. 
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We also applied the same verification technique to the Wikipedia articles we mentioned ear- 
lier, which enabled BCAT to symbolically and numerically verify even complex equations in 
Wikipedia articles. This evaluation is also part of the TPAMI submission. 


Chapter 1 
Introduction 


15 


Section 1.4. Thesis Outline 


Preprints of my publicationsare available at 
https: //pub.agp-research.com 


My Google Scholar profile is available at 
https://scholar.google.com/citations?user=Mq2B90gAAAAJ 


All translations of the DLMF formulae are available at 
https://lacast .wmflabs.org 


A prototype of ACAST for Wikipedia is available at 
https://tpami.wmflabs.org 


This Chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License 


(http://creativecommons.org/licenses/by/4.0/). 
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Section 2.1. Background and Overview 


Making presentational math computable implies a transformation from one mathematical 
representation to another. In order to frame this task, we need to introduce presentational and 
computable formats, and analyze available transformation tools between these formats. There 
is a large variety of different formats available to encode mathematical expressions, from visual 
formats, such as BIFX [220] or MathML [60], to semantic enhanced encodings, such as content 
MathML [270], semantic BIFX [260], SIEX [200], or OpenMath [19], and entire programming 
languages, such as CAS syntaxes [36, 128, 173, 175, 176, 177, 178, 393], theorem provers [37, 
266, 287, 340, 354, 384], or mathematical packages in C++ [168], Python [252] or Java [79]. This 
chapter introduces what we understand as presentational and computable formats, provides an 
overview of math formats, and discusses existing transformation tools between these formats. 


In particular, Section 2.1 introduces presentational and computable formats. Section 2.2 provides 
an extensive overview of mathematical formats, their attributes, and conversion approaches 
between them. Since there are a large variety of conversion tools and approaches available 
for many different formats [39, 200, 18, 351, 406] a translation from a presentational to a 
computable format can be achieved in many different ways. In this thesis, we mainly focus on 
translations from BIFX to CAS syntaxes. The most well-studied translation path from KIEX to 
CAS syntaxes would use content MathML as an intermediate, semantically enriched format. 
Hence, Section 2.3 analyzes state-of-the-art KIEX to MathML converters. Section 2.4 underlines 
the research gap and paves the way for the rest of the thesis by briefly discussing MathIR 
approaches for conversions from presentational to computable formats. Section 2.3 has been 
published at the JCDL [18]. The introduction of math embeddings in Section 2.2 was published 
as a workshop paper at the SIGIR conference [9] and later reused in an extended article for the 
Scientometrics journal [15]. 


2.1 Background and Overview 


Computable encodings are interpretable formal languages in which keywords or sequences of 
tokens are associated with specific implemented definitions, which allows performing certain 
mathematical actions on these elements, such as evaluating numeric values or symbolically 
manipulating the elements. Computable encodings, therefore, must be semantically unam- 
biguous. Otherwise, an interpreter is unable to associate the sequence of tokens with a unique 
underlying definition. This ambiguity problem is mainly solved by interpreters in two ways: 
either the system automatically performs disambiguation steps following a decision tree with 
a fixed set of internal rules, such as x^y^z in Mathematica, or the system refuses to parse the 
expression and returns an error, such as for x~y~z in Maple. 


Computable formats are formal languages that link key words or phrases with 
unique implemented definitions. Computable expressions are semantically unam- 
biguous. 


Presentational formats, on the other hand, focus on controlling the visualization of mathematical 
formulae. They generally allow users to change spaces between tokens (e.g., \, and \; in KIEX), 
support two-dimensional visualizations (e.g., f, 7 dz), or render entire graphs and images. How- 
ever, pure presentational formats (in contrast to enhanced semantic encodings) do not specify 
the meaning of an expression. Consequently, mathematical expressions in presentational for- 
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mats are generally semantically ambiguous, and it is the author’s responsibility to disambiguate 
the meaning of the expression by providing additional information in the context. Digital 
presentational formats, such as BIFX, are also interpretable formal languages?. In contrast to 
computable formats, presentational languages link tokens with specific visualizations rather 
than executable subroutines. Hence, expressions in these formats must be unambiguous too. 
Otherwise, interpreters are unable to link an expression with a unique visualization (see x~y~z 
in BIFX). The difference to computable encodings is that expressions in presentational formats 
must be visually but not semantically unambiguous. For instance, BIFX refuses to parse x7 y~z 
because the rendering of {x~y}~z (see x?) and x”{y”z} (see x”) is different. In contrast, 
Maple rejects x”y”z because there is a mathematical (and in consequence a computational) 
difference between (x”)* and x”). 


g Presentational formats are formal languages with a focus on visualization. 
Presentational expressions can be semantically but not visually ambiguous. 


In this thesis, we focus on BIFX for the presentational format and CAS syntaxes for computable 
formats. We choose KIEX because it is currently the de-facto standard for writing scientific 
papers in the STEM disciplines [129, 402]. Several other word processors, such as the article’s 
editor in Wikipedia? or Microsoft's Word [248], entirely or partially support KTEX inputs. In 
addition, KIEX is the main presentational format that is entered by hand. In contrast, MathML, 
due to its XML datastructure, is not a user-friendly? encoding and mostly automatically gen- 
erated from other formats [82, 159, 18, 374]. Image formats are the result of pictures, scans, 
or handwritten inputs, and, therefore, less machine-readable. As a consequence, image for- 
mats of mathematical formulae are mainly converted into KIEX or MathML in a pre-processing 
step [27, 39, 267, 378, 379, 406, 411]. We choose CAS syntaxes for our target computable format 
because CAS generally support a large variety of different use cases, from manipulations and 
visualizations to computations and simulations [81, 413]. Especially general-purpose CAS, 
such as Maple [36] and Mathematica [393], address a broad range of topics [128, 392]. In con- 
trast, theorem provers, proof assistants, and similar software, as potential other computable 
formats, solely focus on automated reasoning [147, 266, 354, 384]. Hence, the computation of 
mathematical formulae plays a less significant role in such software. 


2.2 Mathematical Formats and Their Conversions 


Figure 2.1 provides an overview of different math encodings and existing conversion approaches 
between them. In addition to the figure, Table 2.1 provides quick access to references for specific 
translation directions. Figure 2.1 organizes formats by their level of semantics and the level 
of machine readability. This categorization is not meant to be as accurate as possible nor to 
be complete. Instead, the figure aims to provide a rough visualization of the most common 
encodings and their differences. For instance, there is no notable technical difference between 


‘Note that this interpretation of presentational formats does not include images. Since images are less machine- 
readable formats, they are generally first converted into interpretable formats, such as BIFX. This conversion process 
is very challenging on its own [406, 411]. Hence, including images for our task would not provide any benefits but 
makes it unnecessarily more complicated. 

*https://en.wikipedia.org/wiki/Help:Displaying_a_formula [accessed 2021-10-01] 

°A little histrionically described as ‘Making humans edit XML is sadistic! from the Django 1.7.11 documenta- 
tion [118]. 
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Level of Semantics 


Figure 2.1: Reference map of mathematical formats and translations between them. The red 
path illustrates the main subject of this thesis. In Section 2.3, we focus specifically on existing 
translation approaches from BIFX to MathML (orange arrows) to evaluate an alternative to the 
red translation path. 


the levels of semantics in content MathML and OpenMath (see the paragraph about OpenMath 
in Section 2.2.1). Nonetheless, OpenMath defines the content dictionaries that content MathML 
uses to semantically annotate symbols beyond school mathematics. Hence, we could argue 
that content MathML encodes less semantic information without the help of OpenMath and, 
therefore, should be positioned more to the left. Another disparity can be found in the level 
of machine readability between CAS syntaxes and theorem provers. Since both formats are 
programming languages, any CAS or theorem prover expression requires a very specific (often 
proprietary) parser. Thus, a programming language is arguably never more machine readable 
than any other programming language. Nonetheless, most CAS prefer a more intuitive input 
format (sometimes even 2D inputs) similar to BIFX over a machine-readable syntax [88, 128, 
179] to improve their user experience. Because of these more user-friendly input formats, we 
positioned CAS syntaxes below theorem prover formats. Note also that math embeddings, i.e., 
vector representation of math tokens, are not in Figure 2.1 because the level of semantics these 
vectors capture is still unclear and an open research question (see Section 2.2.5). The red path 
in Figure 2.1 shows the new translation path that we focus on in this thesis. Dotted arrows 
represent translation paths that generally do not require context analysis and are, therefore, 
of less interest for the subject of this thesis. The orange and red arrows (and highlighted cells 
in Table 2.1) refer to our contributions for this thesis. The red arrows refer the main research 
contribution explained in the chapters 3 and 4. 


2.2.1 Web Formats 


Web formats are designed to display mathematical formulae and knowledge on the web. Con- 
sequently, those formats prioritize machine readability over user experience. Hence, a variety 
of different translation approaches to, from, or between web formats exists. Since mathematics 
in the web is generally embedded in HTML code, most web formats use the XML encoding 
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Table 2.1: Overview table of available mathematical format translations. The highlighted 
conversion fields refer to contributions made in this thesis. The columns and rows refer to: 
‘pMML’ for presentation MathML, ‘cMML’ for content MathML, ‘sem.LaTeX’ for semantic 
BIFX, ‘Theo. Prov. for theorem prover or proof assistants, ‘Img’ for images, and ‘Speech’ for 
spoken (audio) mathematical content. The group ‘Comp. refers to computable formats. In some 
cases, no transformation is necessary, e.g., from OMDoc to OpenMath because OMDoc uses 
OpenMath internally. In this (and similar) cases, we simply refer to the overview publication of 
the format, here [198] for OMDoc. 


E "5 | 8 
eZ 226 8 dais ai fig: 
From a 2) E 3 | = 3 8 l E Q ' S ' E ' 5 
pMML| / [364] | [61] | [391] [86] : [358] [349] 
cMML | [300] / [342 198] | ' [891] [318]: [242] [257]! 2 
OpenMath | [59] [342] / 198] | [61] ' [57] [303] > 
OMDoc | [198] [198] [198] / ' [198 [198] [152] [152]! 
ee LaTeX! [18] [150] [257] [198] / [11] [195]; Mal: [15] i [358]: [249] 
sem.LaTeX | [257] [18] [257 [257] / ' [13] ' [404] ' [257] %5 
sX | [257] [257] [257] [195] | [198] M i ' [257] 
Theo, Prov | 205] [205] 67 — a © io e 
CAS | [391] [391] ' [391] [13] [338] B61]; : [391]: S 
o Veetor = kt sso st 
Ka el ae 
n Speech | [386] Bel Be 
Web ' TeX ' Comp. ! ' ' 


structure. Thus, web formats are often described as verbose and rarely edited or created by 
hand. On the other hand, the XML structure simplifies the inter-connectivity between web 
formats, e.g., via XSL Transformations (XSLT) [362]. There are three main formats used in 
the web: the current web standard MathML, the pure semantic encoding OpenMath, and the 
semantic document encoding OMDoc. Note that many websites still use image formats to 
display math. We will discuss image formats in Section 2.2.4. 


2.2.1.1 MathML 


For the web, the Mathematical Markup Language (MathML) [60] is the current official recom- 
mendation from the World Wide Web Consortium (W3C) and even an official standard since 
2015 [169] for HTML5. MathML is defined via two different markups: the presentation’ and 


“https: //www.w3.org/TR/MathML3/chapter3.html [accessed 2021-10-01] 
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the content? markup. MathML containing only presentation markup elements is, therefore, 
also called presentation MathML or, in case it only contains content markup elements, content 
MathML, respectively. Both markups can be used together side by side for a single expression in 
so-called parallel markup [202, 259, 270]. If elements in the presentation markup are linked back 
and forth with elements from the content markup, the encoding is also called cross-referenced 
MathML. 


Content MathML, in contrast to presentation MathML, aims to encode the meaning, i.e., the 
semantics, of mathematical expressions. Content MathML addresses the issues of ambiguous 
presentational encodings by providing a standard representation of the content of mathematics. 
The encoding comes with a large number of predefined functions, e.g., for sin and log, intend- 
ing to cover most of K-14° mathematics. For formulae beyond school mathematics, content 
MathML use so-called Content Dictionaries (DCs) [204] (see the OpenMath paragraph for more 
details about CDs). Listing 2.1 shows presentation and content MathML encodings for the Leg- 
endre polynomial P, (x). Note that the presentation MathML encoding contains an operator 
(<mo> for mathematical operator) between P,, and (x) which contains the invisible character 
function application (unicode character U+2061). Nowadays, content MathML is often used in 
digital libraries to improve the performance of math search engines with accessible semantic 
information [345, 347, 348, 381]. 


Since MathML is the web standard, there are numerous tools available that convert other 
encodings from and to MathML. Most common conversions include translations between 
presentation and content MathML [139, 270, 364], from [159, 257, 267, 335, 374] and to’ KIEX, 
OpenMath [59, 342, 343], CAS [318], PDF [27, 267], images [406], and audio encodings (mainly 
in the math to speech research field) [67, 349, 387]. The W3C officially lists 42 converters and 
other softare tools that generate MathML on their wiki®. In addition, the official interoperability 
report” of MathML provides a comprehensive overview of software that supports MathML and 
show official statements from implementors. Due to its XML format, most conversion tools use 
XSLT [362] to transform MathML into either other XML encodings or string representations [59, 
61]. This translation approach can be described as rule-based, because in XSLT, we define a set 
of transformation rules for XML subtrees. 


Most of the converters to MathML do not support content MathML. Translations from presen- 
tational formats to content MathML face a wide range of ambiguity issues [159, 259, 374]. For 
example, the <mo> element in Listing 2.1 regularly contains the invisible times symbol (unicode 
character U+2062) rather than function application because most conversion tools interpret 
P, not as a function. For content MathML, even more disambiguation steps are required to 
link P with the Legendre polynomial correctly. For such disambiguation, a combination of 
semantification and XSLT rules are used to perform translations to content MathML [139, 270, 
364]. Nghiem et al. [270] proposes a machine translation approach to generate content MathML 
from presentation MathML but does not consider textual descriptions from the surrounding 
context ofa formula. Likewise, Toloaca and Kohlhase [364] uses patterns of notation definitions 


Shttps://www.w3.org/TR/MathML3/chapter4.html [accessed 2021-10-01] 
“Kindergarten to early college. 
Two well-known projects for translations from MathML to BIFX use XSL transformations: 
web-xslt https: //github.com/davidcarlisle/web-xslt/tree/main/pmml2tex and 
mml2tex https: //github.com/transpect/mml2tex [accessed 2021-10-01]. 
Shttps://www.w3.org/wiki/Math_Tools [accessed 2021-10-01] 
°https://www.w3.org/Math/iandi/mm13-impl-interop20090520.html [accessed 2021-10-01] 
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to find a content MathML expression that matches the presentation MathML parse tree. Grigore 
et al. [139], on the other hand, generates a local context of five nouns prior to the expression 
first to conclude symbol declarations from OpenMath CDs. Besides Grigore et al. [139], other 
existing approaches for translations to content MathML only consider the semantics within 
the given formula itself or in formulae in the same document [159, 259, 374] but ignore the 
textual context surrounding a formula. For example, these tools follow the assumption that a P 
with subscript followed by an expression in parenthesis should be interpreted as the Legendre 
polynomial. However, many expressions cannot be disambiguated without considering the 
textual context, such as the m(x + y) example from the introduction. 


Most CAS support MathML either directly or via external software packages [318, 343]. How- 
ever, to the best of our knowledge, no CAS currently consider the CD in content MathML 
correctly. Hence, these import and export functions in CAS are generally limited to school 
mathematics. It should be noted that the CDs are considered by CAS but only in OpenMath, e.g., 
via the transport protocol Symbolic Computation Software Composability Protocol (SCSCP) [361]. 
Since this protocol was developed to enable inter-CAS communication, we explain this project 
more in detail in Section 2.2.3. 


In summary, a reliable generation of content MathML requires a semantic enhanced source 
formula, e.g., in CAS syntaxes [318, 343], theorem prover formats [152], or OpenMath [59, 
342]. Otherwise, translations tend to generate inaccurate MathML. In Section 2.3, we will 
examine existing BIFX to MathML converters more in detail to investigate the practicality of 
using MathML as an intermediate format for translations from KIEX to CAS encodings. 


2.2.1.2 OpenMath 


The OpenMath Society (originally OpenMath Consortium [19]) defines another standard encod- 
ing called OpenMath [53]. The OpenMath standard aims to focus exclusively on the semantics 
of mathematics and, therefore, going a step further compared to MathML [204], which aims 
to cover both the presentation and the content information in a single format. Originally, 
OpenMath was invented during a series of workshops starting in 1993, mainly from researchers 
in the computer algebra community, to easily exchange mathematical expressions between 
CAS and other systems [19, 89]. MathML, originally developed with the same goal, was first 
released in 19981°. Both formats are very similar to each other [204] and one may ask for the 
purpose of two different formats for more or less the same tasks [82, 114]. Discussions about 
the necessity of both formats raise from time to time even decades later [25, 204]. However, 
OpenMath and MathML have been and are still developed alongside each other rather than 
competing with one another due to a large overlap of people working on both formats [204]. 
To summarize the coexistence today: MathML provides rendered visualizations for OpenMath, 
while the Content Dictionaries (CDs) from OpenMath add semantics to MathML”. 


The OpenMath Society maintains a set of standard CDs. A CD is a set of declarations (i.e., 
definitions, notations, constraints, etc.) for mathematical symbols, functions, operators, and 
other mathematical concepts. The idea behind the publicly maintained CDs by the OpenMath 


https ://www.w3.org/TR/1998/REC-xml- 19980210 [accessed 2021-10-01] 

"A more detailed discussion about the history of both formats can be found at https : //openmath. org/ 
projects/esprit / final /node6 . htm, https: //openmath. org/om-mm1/ [both accessed 2021-10-01], 
and [198, pp. 5]. 
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& Presentational 
MathML & Content MathML 
<mrow> 1 <apply> 
<msub> 2 <csymbol definitionURL="http://www. 
<mi>P</mi> openmath.org/cd/orthpoly1.ocd" 
<mi>n</mi> encoding="DpenMath">legendreP 
</msub> 3 </csymbol> 
<mo> 4 <ci>n</ci><ci>x</ci> 
<!-- Invisible 5 </apply> 
Funct. Appl. 
Unicode U+2061 --> 
</mo> D OpenMath 
<mrow> 
<mo>(</mo> 1 <OMOBJ><OMA> 
<mi>x</mi> 2 <OMS name="legendreP" cd="orthpoly1"/> 
<mo>)</mo> 3 <OMV name="n"/> 
</mrow> 4 <OMV name="x"/> 
</mrow> 5 </OMA></OMOBJ> 


Listing 2.1: The Legendre polynomial in two MathML encodings and in OpenMath. 


Society is to provide a ground truth for math declarations so that the used symbols become in- 
terchangeable among different parties. However, everybody can create new custom CDs which 
might be integrated into the existing standard set maintained by the OpenMath Society [90]. 
M. Schubotz [327], for example, proposed a concept for a CD that uses on the knowledge 
base Wikidata. More recently, B. Miller [258] created a content dictionary specifically for the 
functions in the DLMF. 


Listing 2.1 compares both MathML markups with OpenMath. While the tree structures of 
content MathML and OpenMath cannot directly be compared with mathematical expression 
trees [331] (see also Section 2.2.4), the XML tree structure of both formats is unique. Both 
formats rely on the CD entry of the Legendre polynomial in orthpoly1'?. Since the CD is from 
OpenMath, the OpenMath encoding does not require the entire url. The CD entry further 
specifies that the Legendre polynomial has two arguments. Hence, the following two siblings in 
the tree structure are considered to be the arguments. OpenMath specifically annotate them as 
OMV (for variable objects). Alternatively to the orthpoly1 CD by OpenMath, one can also use 
Schubotz’s [327] Wikidata CD to annotate P with the Wikidata item 0215405 or Miller’s [258] 
DLMF CD to link P to §18.3 of the DLMF [98, (18.3)]. 


As previously mentioned, both formats (content MathML and OpenMath) are rather similar 
to each other [56, 343]. Hence, there are several ways to transform mathematical expressions 
between both formats [343], e.g., via XSLT [59, 342]. This transformation is possible without 
information retrieval techniques since both formats encode the same level of semantic infor- 
mation via CDs. Even though the primary goal for OpenMath was to provide a format that 
allows communication between mathematical software [19], most CAS do not support Open- 
Math directly. Instead, an independent project of research institutions funded by the European 
Union was launched to improve the symbolic computation infrastructure in Europe. The main 


“https: //openmath. org/cd/orthpoly1.html#legendreP [accessed 2021-10-01] 
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result of this project was the SCSCP protocol for inter-CAS communication via OpenMath. We 
will discuss the SCSCP protocol and the project more in detail in Section 2.2.3. Several CAS, 
including Maple [243] and Mathematica [44], implemented endpoints for the SCSCP protocol. 
Hence, via this new protocol, CAS support OpenMath to some degree. Apart from the protocol 
solution, there are some research projects available that use OpenMath as an interface to and 
between CAS and theorem prover formats [57, 152, 303, 338, 343]. 


2.2.1.3 OMDoc 


Sometimes, it might be worthwhile to annotate the context of mathematical expressions with 
additional information explicitly. For example, an equation might be part of a theorem that has 
not been proven yet. Hence, that particular equation and its context should not be confused 
with a definition. Since this meta-information about mathematical expressions is organized 
on a document level, Kohlhase [198, 199] introduced another format, the Open Mathematical 
Document (OMDoc), to semantically describe entire mathematical documents. While formats 
like OpenMath or MathML encode the semantics of single expressions, which Kohlhase describes 
as the microscopic level, OMDoc aims for the macroscopic, i.e., the document level. This format 
can be especially useful for interactive documents [80, 85, 131, 150, 162, 201] and theorem 
prover [38, 146, 163, 340] which generally rely more on the meta information from a document 
level. Single math expressions in OMDoc are still encoded as OpenMath for the semantics and 
MathML for the visualization. In turn, this thesis focuses more on the formats that directly 
encode mathematical expressions rather than a macroscopic level encoding. Nonetheless, it 
should be noticed that a translation to a CAS might be different depending on the scope of 
an equation, e.g., an equation symbol in a definition differs from an equation symbol in an 
example. Heras et al. [152], for example, used OMDoc to interface CAS and theorem prover. 
Hence, the OMDoc format might be worth supporting once the translation reaches a level of 
reliability and comprehensiveness that the semantics on the document level matter (see the 
future work section 6.3). 


2.2.2 Word Processor Formats 


The previously explained formats of mathematics are beneficial for web applications and ex- 
changing mathematical knowledge between systems. However, the underlying verbose XML 
data structure makes manual maintenance of these formats too cumbersome. In turn, MathML 
and OpenMath, considering a specific size, are almost always computer-generated. The actual 
source of the data, something a human manually typed, uses a different format, such as KIEX, 
visual template editors, or image formats. In the following, we introduce formats and methods 
used to type mathematics in word processors manually. 


2.2.2.1 TEX 


BIFX is currently the de-facto standard for writing scientific papers in the STEM disciplines [129, 
220, 402] and has even been described as the lingua franca of the scientific world [220]. Numerous 
other word processors entirely or partially support BIFX inputs. KIEX was developed by Leslie 
Lamport and extended the TFX system with some valuable macros that make working with TEx 
easier [220]. TEX was developed by Donald E. Knuth [189, p.559] in 1977. Knuth was dissatisfied 
with the typography of his book, The Art of Computer Programming [189, pp. 5, 6, and 24] 
and created TEX to overcome the hurdles of consistently and reliably typesetting mathematical 
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formulae for printing. Today, there is no significant difference between BIFX and TFX in terms 
of mathematical expressions. Hence, we continue using FTpX as the modern successor and 
refer to TFX only to underline technical differences or to describe the underlying base for other 
TexX-like encodings. KIEX provides an intuitive syntax for mathematics that is similar to the 
way a person would write the math by hand, e.g., by using the underscore to set a sequence of 
tokens in subscript. 


BIFX is an interpretable language that requires a parser. Theoretically, the flexibility of BIFX (and 
especially the underlying TEX implementation) makes parsing BIFX really challenging [187]. 
For example, TFX allows to redefine every literal at runtime, making TFX (and therefore BIEX 
too) to a context-sensitive formal language. However, in practice, most KIRX literals are gen- 
erally not redefined. Instead, it is common to extend KIEX with additional commands rather 
than redefining existing logic. Especially in mathematical expressions, several projects simply 
presume that KIEX is parsable with a context-free grammar, which makes parsing mathematical 
expressions in BIFX a lot simpler [71, 402]. 


Since BIFX is the standard to typeset mathematics, there are numerous of translation tools to the 
webstandard MathML available [133, 135, 159, 257, 267, 335, 374] (see also MathML explanation 
in Section 2.2.1). In the next Section 2.3, we will focus more closely on translations between BIFX 
and MathML. BIFX is also a standard target encoding for Optical Character Recognition (OCR) 
techniques [406, 411], which retrieve mathematical expressions from images or PDF files (see 
Section 2.2.4). KIEX focus solely on the representation of math (similar to presentation MathML). 
Additionally, recent studies try to explore the capabilities of trained vector representations of 
BIEX expressions [121, 15, 215, 360, 400, 404] to explore new similarity measure and search 
engines [404], classification approaches [404], and even automatically generating new BIFX 
expressions [400]. Nonetheless, the effectiveness of capturing the semantic information with 
these methods is controversial [9]. 


KTEX to CAS converters Most relevant for our task are existing translation approaches 
directly from BIFX to CAS sytanxes. These translators can be categorized in two groups: (1) 
CAS internal import functions and (2) external programs for specific or multiple CAS. Mathe- 
matica [391] and SymPy [357] are two CAS with the ability to import BIFX expressions directly. 
SymPy’s import function was ported from the external latex2sympy" project. Examples of 
external tools are SnuggleTeX [251] and our in-house translator BCssT [3, 13]. SnuggleTeX 
is a KIEX to MathML converter with the experimental feature to perform translations to the 
CAS Maxima [324]. ACAT is the predecessor project of this thesis and focused on translating 
semantic BIFX from the DLMF to the CAS Maple. 


All of these converters are rule-based translators, i.e., they perform translations on hard- 
coded pre-defined conversion rules. SnuggleTeX support translations to Maxima since version 
1.1.0 [251]. The tool allows users to manually predefine translation rules, such as interpreting e 
as the mathematical constant, I’ as the Gamma function, or f as a general function. SnuggleTeX 
is no longer actively maintained and mostly fail to translate general expressions. The developers 
themselves declare the translation to Maxima as experimental and limited!?. SymPy, in contrast, 


“The project is therefore no longer actively developed but still available on GitHub: https: // github . com 
/augustt198/latex2sympy [accessed 2021-10-01] 

“https: //www2.ph.ed.ac.uk/snuggletex/documentation/semantic- enrichment . html [accessed 
2021-10-01] 
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is actively maintained and provide a more sophisticated import function for KIEX expressions. 
SymPy’s import function parses a given KIEX expression via ANTLR” and traverses through 
the parse tree to convert each token (and subtree) into the SymPy syntax. SymPy uses a set of 
heuristics that mostly cover standard notations, including \sin. Additionally, it uses pattern 
matching approaches to identify typical mathematical concepts, such as the derivative notation 
in a sin(x). Similarly, HCAS first parses the input expression with the Part-of-Math (POM) 
tagger [402] and performs translations by traversing through the parse tree. The POM tagger 
tags tokens with additional information from external lexicon files. ACAST manipulates these 
lexicon files to tag tokens with their appropriate translation patterns. CAT takes the translation 
patterns attached to a single token and fills them with with the following and preceding nodes 
in the parse tree to perform a translation. Within this thesis, we will extend BCT further with 
pattern matching techniques and human-inspired heuristics to perform more general formulae, 
including the derivative notation example, sums, products, and other operators. A more detailed 
discussion about the first version of ACAST is available in [13]. 


While SymPy and SnuggleTeX are open source and allows interested readers to analyze the 
internal implementation details, we can only speculate about the solutions in proprietary soft- 
ware, such as Mathematica. As we saw in Table 1.2 (and later in Chapter 4), Mathematica seems 
to follow a pattern recognition approach to link known notations, such as PP) (x), to their in- 
ternal counterparts, such as JacobiP[n, \[Alpha], \[Beta], x]. Since Mathematica (nor 
does any other CAS or mentioned converter) analyze the textual context of a formula, import- 
ing ambiguous notations generally fail. Since the internal logic (and therefore the underlying 
patterns) is hidden, it is difficult to estimate the accuracy and power of Mathematica’s KIEX 
import function. As an alternative to Mathematica itself, one can use WolframAlpha!® [309]. 
WolframAlpha is described as a knowledge or answer engine. Technically, WolframAlpha is 
a web interface which uses Mathematica as backbone for computations. WolframAlpha per- 
forms numerous of pre-processing and interpretation steps to allow users to generate scientific 
information without inputting specific Mathematica syntax [64, 383]. 


Table 2.2 compares the converters on our introduction examples (see Table 1.2). The table 
contains also ACT first version (published in 2017 [3]) for comparison. We observe that Wol- 
framAlpha clearly performs best on this simple general inputs. The reason is that WolframAlpha 
focus on a broad, less scientific audience which allows the system to make several assumptions. 
On more topic specific inputs, such as PK) (cos(a®)), it fails. This is further underlined by 
the fact that Mathematica itself has no trouble interpreting Pio) (cos(a®)). This indicates 
that both systems are optimized for their expected user groups. On these simple cases, SymPy 
also performs better compared to Mathematica. However, SymPy’s size and support of special 
functions is not comparable with Mathematica and therefore falls behind Mathematica on a 
more scientific dataset, such as the DLMF. 


A more sophisticated evaluation on 100 randomly selected DLMF formulae revealed that Math- 
ematica can be considered the current state-of-the-art for translating BIFX to CAS. Nonetheless, 
it only translated 11 cases correctly compared to 7 successful translations by SymPy and 22 by 
ECAT. The full benchmark is available in Table E.1 in Appendix E.1 available in the electronic 
supplementary material. 


5 ANother Tool for Language Recognition (ANTLR): https : // www. antlr . org / index . html [accessed 
2021-10-01] 
*°Often stylized with Wolfram|Alpha 
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Table 2.2: BIFX to CAS translation comparison between Mathematica’s (MM) and SymPy’s (SP) 
import functions, SnuggleTeX (ST) translation to Maxima, WolframAlpha (WA) interpretation of 
BIEX inputs, and the first version of ACT (LCT)) 


KIEX | Rendering |MM| SP ST WA LCT, 
Nalatie_ 217) 52 eb: J? xdx : x : YV x v : x 
\int_a^b x \mathrm{d}x | J? ada x | x x | Vv | x 
\int_a’b x\, dx | [xde Vv | YV x | Vv | x 
\int_a°b x\; dx I Sad Xv xv x 
\int_a°b x\, \mathrm{d}x | MEE x | x x | v | x 
\int_a°b \frac{dx}{x} | re x | vix | er M x 
\sum_{n=0}"N n72 | sun Vv | Vi | Z | v 
\sum_{n=0}°N n^2 + n Eotn ir xin? 
{n \choose m} | E) x | x x | = | x 
\binom{n}{m} i v vi | Y | 2 
P_n“{(\alpha,\beta)}(\cos(a\Theta)) | PL) (cos(a®)) | y | ves | 7 | z 
\cos(a\Theta) | cos(aO) v | Viv | Vv | Vv 
\frac{d}{dx} \sin(x) | £ sin(x) x | v x | v | x 


Since BIX can be easily extended with new content via macros, some projects try to semanti- 
cally enhance KIEX with unambiguous commands. The two most comprehensive projects are 
semantic BIFX and SIEX. 


2.2.2.2 Semantic/Content LaTeX 


D The Jacobi polynomial in KTEX and semantic ATEX 


1 P_n’{(\alpha , \beta)}(x) % Generic LaTeX 
2 \JacobipolyP{n}{\alpha}{\beta}@{x} % Semantic LaTeX 


Listing 2.2: The Jacobi polynomial in KIEX (line 1) and semantic BIFX (line 2). 


Semantic BIFX (also known as content BIFX) was developed by Bruce Miller [260] at the Na- 
tional Institute of Standards and Technology (NIST) to semantically enhance the equations in 
the DLMF [403]. Essentially, semantic KIEX is a set of custom KIEX macros which are linked 
to unique definitions in the DLMF. Consider for example the Jacobi polynomial in Listing 2.2. 
The general BIFX expression does not contain any information linked to the Jacobi polynomial. 
However, semantic KIEX replaces the general expression with a new macro \JacobipolyP 
which is linked to the DLMF [98, (18.3#T1.t1.r2)]'7. In addition, all variable arguments (parame- 


“Hereafter, we refer to specific equations in the DLMF by their labels. The label can be added to the base 
URL of the DLMF. For example, the sine function is defined at 4.14.E1, which can be reached via https : 
//dlmf .nist.gov/4.14.E1 [accessed 2021-10-01]. 
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ters and variables) are separated and ordered following the function command. This separation 
is essential to disambiguate notations. For example, the sine function is sometimes written with- 
out parenthesis, such as sin x, resulting in ambiguous semantic notations, such as in sin x + y. 
The semantic KIEX macros allow to visualize this expression but encode it unambiguously via 
\sin@@{x+y} (which is rendered as sin x + y). Originally, the semantic KIEX helped to develop 
a reliable search engine for the DLMF [260]. Nowadays, the macros are also in use in other 
projects and have been even extended for the Digital Repository of Mathematical Formulae 
(DRMF) [77, 78], an outgrowth of the DLMF. 


Semantic BIFX will play a crucial role in the rest of this thesis because it allows us to stick 
with the easily maintainable syntax of BIEX but semantically elevates the information of math 
expressions to a level that can be exploited for translations towards CAS [3, 8, 13]. The main 
reason is that the semantic BIEX macros mostly cover OPSF from the DLMF. OPSF are a set of 
functions and polynomials which are generally considered as important, such as the trigono- 
metric functions (also categorized as elementary functions), the Beta function, or orthogonal 
polynomials. Most OPSF have more or less well-established names and standard notations. The 
DLMF (i.e., especially the original book [276]) is considered a standard reference for OPSF [381]. 
General-purpose CAS, such as Mathematica and Maple, focus also on the comprehensive sup- 
port of OPSF [381]. Hence, semantic KIEX macros play a crucial role for translations from KIEX 
to CAS syntaxes. Since CAS syntaxes are programming languages, CAS can be extended with 
new code. However, translating new math formulae to CAS can become arbitrarily complex. 
Consider the prime counting function would be not supported by Mathematica. In this case, 
a(x + y) cannot be translated to a simple mathematical formula in the syntax of Mathematica 
but would require entire new subroutines. Therefore, a comprehensive, viable, and reliable 
translator from KIEX to the syntax of CAS should maximize its support for OPSF in order to be 
useful. 


Definition 2.1 provides a brief definition for the elements of a semantic macro. While the 
semantic source of the DLMF is publicly available [403], the actual definitions, i.e., the BIFX 
style files, of the macros, are still private!®. B. Miller provided access to the definitions of 
the macros for this thesis. Later in this thesis, we will rely on additional meta-information 
given for each semantic macro. This includes default parameters and variables, a short textual 
description, and links to the DLMF CD [258]. Further information is not explicitly given in 
the macro definition files. For example, function constraints, domains, branch cut positions, 
singularities, and other properties are only given in the DLMF. 


As previously mentioned, we!” developed BCT for translating semantic BTEX DLMF formulae 
to CAS [3, 13]. The first version did not contain any disambiguation steps or pattern matching 
approaches to deduce the intended meaning of an expression. Instead, if fully relied on the 
semantic KIEX macros to perform translations to Maple. For example, sums or products were 
not supported directly but required the semantically enhanced macros from the DRMF [77, 
78]. The source of ACAT is not yet publicly available? due to the dependency to the POM 
tagger [402] and the semantic BIFX macros [260, 403] but accessible via open API endpoints?!. 


"® As of 2021-10-1. 

The first version of BCasT was the subject of my Master’s thesis and laid the foundation for a reliable translation 
from semantic KTEX to multiple CAS. 

” As of 2021-10-01. 

*1The API contains a Swagger UI and is reachable at https : //vmext-demo. formulasearchengine . com 
[accessed 2021-10-01]. BCasT is available under math/translation path (in the math controller). The experimental 
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® Definition 2.1: The elements of a semantic macro 


A semantic BIFX macro is a BIFX macro with a unique name followed by a number 
of arguments. Certain elements of the following arguments are optional but the order 
remains the same. While a caret and primes are interchangeable, each order would have 
a different meaning, as it can be seen in the example below. 


A semantic macro and its arguments: 


\macro The unique semantic macro name with a backslash 
[optPar] An optional parameter in square brackets 
{par} Parameters in curly brackets 
2 oe Optional prime symbols or a caret for power notations 
@ A number of @ symbols to control the visualization of the macro 
{var} Variables in curly brackets 
Examples: 


\sin@{x} — sin(x) 

\sin@@{x} — sing 

\Bessel J{\nu}’?*20{z} — J!?(z) 
\BesselJ{\nu}~2’ ’?@{z} — ENE 
\genhyperF{2}{1}0{a,b}{c}{z} — 2F;(a, b; c; z) 
\genhyperF{2}{1}00{a,b}{cHz} — oF, (%;2) 
\genhyperF{2}{1}@@@f{a, b}{c}{z} — 2Fi(z) 


Apart from ACT, KTExmr [257] is another tool that supports semantic BIFX and provides 
conversions to BIFX, MathML, and a variety of image formats. KIExmr was also developed by 
B. Miller with the original goal to support the development of DLMF [133]. KIExmr is a general 
BIEX to XML converter. However, in order to support the development of the DLMF, BIExmL 
is able to fully load semantic KIEX definition files to convert semantic KIEX into semantically 
appropriate content MathML. With this ability, KIExmı is generally capable of converting other 
BIEX encodings too, such as the following sIpx. 


2.2.2.3 sTeX 


SIgX refers to semantic TEX and should not be confused with B. Miller’s semantic BIFX. SIEX was 
developed around 2008 [194, 195, 200] with the goal to semantically annotate KIEX documents 
with semantic macros. Specifically, SIEX should serve as a source format to generate the semantic 
document format OMDoc. While the underlying motivation and technical solution of SIEX and 
semantic KIRX are very similar, there are some core differences between both formats. Semantic 
BIEX was developed specifically for the DLMF and, therefore, provide semantic macros for 
OPSF. In particular, a semantic macro in the DLMF represents a specific unique function. In 
turn, SIEX aim to cover general mathematical notations and provide a logic to semantically 
annotate general functions and symbols. Consider the aforementioned example m(x + y). If 
m is referring to the prime counting function, we can resolve the ambiguity with semantic 
BIEX via \nprimes@{xty} since the semantic macro \nprimes is referring to that function. 


flag performs pattern matching approaches described later in this thesis. The label allows to specify a DLMF equation 
label to perform specific assumptions (e.g., that ö is an index and not the imaginary unit). 
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In SIEX, an author can use modules and IDs to define the function and set the notation via 
\symdef{\pi} [1] {\prefix{\pi}{#1}}. While this makes the interpretation of m(a + y) 
unambiguous, an underlying definition is still missing. Hence, SIEX provides the option to link 
symbols with their definitions in the document. This definition linking underlines the original 
motivation and connection to the semantic document format OMDoc. 


Since SIEX is not limited to specific domains, we could define any notation we want in our 
semantic document. On the other hand, this generalizability of STEX makes the format more 
verbose and somehow similar to a programming language. In SIEX, we need to define and 
declare symbols explicitly. In addition, a defined new symbol still needs to be manually linked 
to an underlying definition. In semantic BIFX, the macro itself is linked to the appropriate 
definition in the DLMF. SIEX provide access to predefined sets of macros that aim to cover K-14 
mathematics [195]. 


In conclusion, SIEX is flexible but verbose. The format is useful when it comes to annotating 
a general mathematical document semantically. However, the strength of SIEX, for example, 
the ability to define any symbol with specific semantics, is generally not very important for 
translations to CAS. CAS have a fixed set of supported functions and often try to mimic common 
notation styles, e.g., one does not need to define — as a unary postfix operator in —2. In turn, a 
translation from BIFX to CAS faces the issue of identifying the name of the functions involved, 
its arguments, and the appropriate mappings to counterparts in CAS syntax. Semantic KIEX, 
on the other hand, provides a syntax that makes it easy to solve these issues. The name of the 
function is directly encoded in the name of the macro, the arguments are explicitly declared 
and distinguishable (by curly brackets), and a mapping to an appropriate counterpart in the 
CAS can be more easily found due to the large overlap of functions in the DLMF and supported 
functions in CAS. 


As previously mentioned, ETEXxML [257] is able to load TeX definition files and support conversion 
to XML encodings. Hence, KIExML can transform SIEX expressions to content MathML[200]. 
The ability to link SIEX symbols with their definitions in a document or external source further 
makes it to a source for generating entire semantic enhanced OMDoc documents [195]. SIEX 
could be also used as an alternative to semantic KTIEX for translations to CAS. However, due to 
the natural overlap of functions in the DLMF and CAS, at some point in the development of a 
translation process on SIEX, we would create semantic enhanced macros for OPSF similar to 
the existing semantic KIEX macros. Hence, using SIRX in comparison to semantic BIFX has no 
direct advantages to perform translations towards CAS. The higher flexibility of SIEX makes it 
a good candidate for translations beyond OPSF. 


2.2.2.4 Template Editors 


Since BIFX is an interpretable language with over ten thousand mathematical symbols 
alone [280], learning KIEX syntax is often simply too time-consuming and complex for many 
users. To provide an easier access to rendered mathematics, especially in so-called what you see 
is what you get (WYSIWYG) editors, such as Microsoft’s Office programs”? or Wikipedia’s visual 
article editor”, template editors become the norm. Template editors provide visual templates 


**https://support.microsoft.com/en-us/office/ 
equation-editor-6eac7d71-3c74-437b-80d3-c7dea24fdf3f [accessed 2021-10-01] 

The wikipedia’s article about formula editors (https : //en. wikipedia. org/wiki/Formula_editor 
[accessed 2021-10-01] 
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Figure 2.2: The math template editor of Microsoft’s Word [395]. 


of standard mathematical notations so that the user only needs to fill in the remaining spaces. 
Figure 2.2 shows the template editor of Microsoft’s Word [395] for a snippet of the templates for 
sums. Modern graphic interfaces of CAS also often contain such template editors to improve 
the user experience further. In comparison to BIEX, template editors are generally easier to use 
but limited to the offered templates. Hence, for more complex expressions, template editors 
are often described as confining [273]. Template editors do not introduce a new math format. 
The editors only provide a different input method but encode the mathematical formulae in 
system-specific formats, such as MathML in Microsoft’s Word or Maple syntax in Maple. 


2.2.3 Computable Formats 


So far, we have covered the major formats that focus on the presentation of mathematical 
expressions and on formats that capture the semantics. Even though formats like content 
MathML, OpenMath, and the semantic BIFX extensions can resolve the ambiguity of math 
formulae, they are not computable formats, i.e., we cannot perform actual calculations and 
computations on them. The syntax of a computable format is a formal language in which every 
word is linked to specific subroutines. Much like programming languages, computable formats 
are semantically unambiguous and interpretable. In turn, computable formats are generally 
part of a larger software package that ships an interpreter to parse inputs and an engine that 
performs the computations. In the following, we briefly discuss CAS and theorem prover 
formats as examples of computable formats. We will not specifically focus on math packages 
for specific programming languages, such as C++ [168], Python [252] or Java [79]. Most CAS 
and theorem provers, however, internally rely on those lower-level packages to some degree. 


2.2.3.1 Computer Algebra Systems 


A CAS is a mathematical software that can perform a variety of mathematical operations on 
math inputs, such as symbolic manipulations, numeric calculations, plotting and visualization, 
simplification, and many more [76, 81, 128, 413]. With the increasing power of computers, CAS 
became a crucial part of the modern scientific world [32, 262, 352, 356] and are widely used 
for mathematical problem solving [49, 51, 127, 216, 414], simulations [46, 142, 166, 265, 294], 
symbolic manipulations [115, 325], and even for teaching students from schools to universities 
[158, 237, 244, 350, 363, 365, 389, 390]. Due to their complexity, CAS are often large and expen- 
sive proprietary software packages [36, 164, 393]. However, there are several well-known open 
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source options available [42], such as SymPy [252], Axiom** [176], and Reduce” [151]. Many 
CAS focus on specific domains or mathematical tasks, such as Cadabra [289, 290, 291] (tensor 
field theory), FORM [372] (particle physics), GAP [177] (group theory and combinatorics), PAR- 
1/GP [283] (number theory), or MATLAB [164] (primarily for numeric computation). In contrast, 
general-purpose CAS, including Mathematica [393], Maple [36], Axiom [176], SymPy [178, 
252], Maxima [264, 324], or Reduce [151], aim to provide a large set of tools and algorithms that 
are beneficial for many mathematical applications. Therefore, general-purpose CAS support 
a large number of OPSF, since these functions and polynomials are used in a large variety 
of different scientific fields, from pure and applied mathematics to physics and engineering. 
Therefore, we primarily focus on translations to general-purpose CAS in this thesis rather than 
to domain-specific CAS. 


The input formats of general-purpose CAS are often multi-paradigm programming lan- 
guages [88], i.e., they combine multiple standard programming features, such as functional, 
mathematical, and procedural approaches. Major CAS generally use their own input language, 
such as the Wolfram Language in Mathematica [392]. Like any programming language, 
the input format must be unambiguous to the underlying parser of the CAS so that every 
keyword is uniquely linked to subroutines in the CAS engine. This link to a subroutine makes 
the expression computable. In contrast, the semantic KIEX macros are linked to theoretical 
mathematical concepts defined in the DLMF but not with specific implementations. Hence, a 
translation to a CAS syntax requires to link mathematical notations, e.g., I (z), that refer to 
specific mathematical concepts, e.g., the Gamma function, to the correct sequence of keywords 
in the CAS, e.g., GAMMA (z) in Maple. 


Since computable languages naturally encode the highest level of semantic information in 
their expressions, a translation towards other systems that encode less semantic information 
is possible with a comprehensive list of simple mapping rules. Many CAS therefore provide 
a variety of different output formats, from BIEX to MathML (including content MathML) and 
images. Translations between CAS or other mathematical software, such as theorem prover, 
require more sophisticated mappings due to system-specific implementations [110]. From 2006 
to 2011, a joint research project funded by the European Union with over 3 Million Euro launched 
intending to improve the symbolic computation infrastructure for Europe”°. The result of the 
SCIEnce project was the Symbolic Computation Software Composability Protocol (SCSCP) [119, 
361], which uses the OpenMath encoding to transfer mathematical expressions. Using the 
SCSCP, interfaces for GAP [206], KANT [120], Maple [243], MuPAD [155], Mathematica [44], 
and Macaulay2 [311] were implemented. 


Note that there are solutions available that do not require any translation between KIEX and 
CAS. For example, the CAS syntax of Cadabra [291] is a subset of TFX itself. Similarly, SageTeX?’ 
is a BIFX package that allows authors to enter SageMath [317] expressions into KIEX documents, 
turning the document into an interactive document [201] to some degree. SageMath is a general- 
purpose CAS that relies on existing solutions for domain-specific tasks, such as GAP [177] for 
group theory or PARI/GP [283] for number theory problems. These solutions do not require 


*4Open source since 2001 (first released in 1965). 

Open source since 2008 (first released in 1963). 

EU FP6 project 026133: https: //cordis. europa. eu/project/id/26133/ [accessed 2021-10-01] 
"https: //doc.sagemath.org/html/en/tutorial/sagetex.html [accessed 2021-10-01] 
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translations since the input must be provided in the syntax of the CAS. Hence, a translation 
must be performed manually or via external tools. 


In the introduction, we mentioned potential issues of CAS with multi-valued functions. Multi- 
valued functions map values from a domain to multiple values in a codomain and frequently 
appear in the complex analysis of elementary and special functions [8]. Prominent examples 
are the inverse trigonometric functions, the complex logarithm, or the square root. All modern 
CAS?® compute multi-valued functions on their principle branches which makes these functions 
effectively single-valued (e.g., a calculator always returns 2 for v4 rather than +2 or just —2). 
The correct properties of multi-valued functions on the complex plane may no longer be valid by 
their counterpart functions on CAS, e.g., (1/z)” = 1/(z”) for z,w € C and z ¥ 0 is no longer 
valid within CAS. The positioning and handling of branch cuts in CAS is often discussed in 
scientific articles and generally prominantly noticed in CAS handbooks [83, 84, 91, 108, 171, 172]. 
However, especially in more complex scenarios, it is easy to lose track of branch cut positioning 
and evaluate expressions on incorrect values. We provide a more complex example and a more 
detailed explanation of branch cuts in Appendix A available in the electronic supplementary 
material. To the best of our knowledge, no available translation tool from, to, or between CAS 
(including the SCSCP solutions) consider branch cut positions. 


2.2.3.2 Theorem Prover 


The idea of automated reasoning and deduction systems is as old as computers [147]. With 
the power of computers and a strict axiomatic approach as in Principia Mathematica [385], 
computers can perform automatic reasoning steps to discover and proof new mathematical 
theorems. Up until today, automated theorem proving and verifying is an extensive research 
area with an ever-growing interest [266, 354, 384]. There are numerous theorem provers and 
proof assistants systems available, such as HOL Light [146], HOLF [340], or Isabelle [287]. 
However, focusing on the deduction, the encoding of theorem provers generally goes beyond 
mathematical expressions. The syntax provides specific options for assumptions, links between 
multiple concepts, and logical steps. An example of a proof by Isabelle, which clearly visualizes 
the different notation of theorem provers and CAS, is given in Appendix C available in the 
electronic supplementary material. 


Nonetheless, theorem prover formats are computable formats with specific mathematical ap- 
plications. Hence, there is a genuine interest in transferring findings and solutions from one 
system to the other. There are some translation approaches between theorem prover and CAS 
available, from direct translations [28, 148] to translations over OpenMath [57, 338] and OM- 
Doc [152]. Theorem provers are generally unable to compute a single mathematical formula 
in the sense of numeric computations or symbolic manipulations. Hence, we do not choose 
theorem provers as the target computable format for our desired translation process. 


2.2.4 Images and Tree Representations 


In the following, we briefly discuss formats with the specific visualization focus: images and 
tree representations. Especially older literature is often only available in digital scans, and many 
copies of publications do not provide access to the original KIEX source. Images can be con- 


"The authors are not aware of any example of a CAS which treats multi-valued functions without adopting 
principal branches. 
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sidered as the purest presentational format of mathematical expressions. Tree representations 
of math expressions, on the other hand, are more theoretical concepts to visualize the logical 
or presentational structure of math. Tree representations are primarily used for explanation 
purposes to underline or visualize an idea or concept. Parse trees, as a generated specific tree 
format of mathematical string inputs, on the other hand, play a crucial role in almost every 
mathematical software tool. Often, digital mathematical formats try to mimic the logical tree 
structure of math expressions. This is also one of the reasons why the web formats (MathML 
and OpenMath) use XML to encode mathematical content. 


Symbolic Layout, Operator, Parse, and Expression Trees Mathematical expressions are 
often represented in tree structures. For example, MathML itself is an XML tree data structure. 
Moreover, mathematicians often have a logical but theoretical tree representation of a formula 
in mind in which numbers and identifiers are terminal symbols (leaves) and children of math 
operators, functions, and relations [192, 331]. These so-called expression trees are more or less 
a theoretical structure and are mainly used to visualize logical correlations and connections 
in mathematical expressions. Schubotz et al. [331] attempted to automate the visualization 
process of expression trees based on cross-referenced MathML data which resulted in VMEXT, 
a visualization tool for MathML. Figure 2.3 shows a possible expression tree visualization for 
the Jacobi polynomial definition in terms of the hypergeometric function. 


"Fr, ( n,l+at+ 64 nya +1;3(1—2)) 
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Figure 2.3: An expression tree representation of the explicit Jacobi polynomial definition in 
terms of the hypergeometric function. 


For visualization and education purposes, these tree representations can be beneficial. However, 
generating these trees requires a deep understanding of the logical structure of the expression. In 
addition, there is no exact definition available for expression trees. Hence, the exact visualization 
is often up for discussions, e.g., whether parameters are children similar to variables or part of 
the function node itself [9]. A missing standard definition makes expression trees unreliable 
and, therefore, less practical for a mathematical encoding. 


Parse Trees Parse trees are generated tree representations of source expressions (strings). 
These trees are generated by a parser that follows a strict set of rules, e.g., a context-free gram- 
mar [101, 188, 298]. Mathematical BTFX (as a subset of TEX) considering a couple simplifications 
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(e.g., no re-defined standard literals and macros) can also be described in a context-free gram- 
mar [402] even though TFX itself is Turing complete [133, 135, 187]. The POM tagger [402], 
for example, parses mathematical KIEX following a context-free grammar. Similarly, Chien 
and Cheng [71] build a custom context-free grammar parser for their semantic tokenization 
of mathematical BIFX expressions. KIExML follows the more sophisticated TpX-like digestion 
methods [187] to parse entire TFX files [133, 135]. CAS inputs are parsed internally for further 
processing [138, 392]. Maple’s internal parser also generates a parse tree in which equivalent 
nodes are merged together for more efficient memory usage (mathematically speaking, this 
data structure is no longer a valid tree but instead a directed, acyclic graph, or simply DAG) [3, 
13]. 


In contrast to theoretical tree representations, such as the mentioned expression trees, parse 
trees are crucial for many applications because a tree data format is more easy to process due 
to their structural logic [93, 242, 286, 406]. While string sequences of commands may contain 
ambiguities, tree data structures are unique and provide easy access to single logical nodes, 
groups of nodes, and their dependencies. Hence, parsing a mathematical input (such as in 
CAS inputs or BIFX expressions) is typically the first step in any processing pipeline. Later 
in this thesis, we will also take advantage of tree representations by defining a translation 
between math formats as graph transformations on their tree representations. To generate a 
tree representation of mathematical BIFX formats, we can either build a custom parser [71] 
or rely on existing parsers, such as KIExmr [257] or the POM tagger [402]. Parse trees (and 
other custom generated tree formats that are generated by analyzing a given input) can also 
be categorized into symbol layout trees (for presentational formats) and operator trees (for 
content/semantic formats) [406]. For example, parsing BIFX may result in a symbol layout tree 
that describes the visual structure of formulae while parsing semantic BIEX (or CAS inputs) 
may result in operator trees which describe the logical mathematical structure of the input. 


Images From pixel graphics (e.g., JPEG or PNG) to vector graphics (e.g., Scalable Vector 
Graphics (SVG)) and document formats (e.g., PDF), mathematical expression can appear in 
a variety of different image formats. The two-dimensional structure of mathematics makes 
drawing mathematical formulae on a sheet of paper or touch screens the most intuitive input 
method for mathematics. In addition, with rising digitization, scans of old scientific articles are 
no longer the only source of math images. Handwriting systems are more and more adopted in 
offices and educational institutions [411]. In 2016, Wikipedia switched from non-scalable PNG 
images to vector graphics for visualizing mathematics [17] (see Appendix B available in the 
electronic supplementary material, for a more sophisticated overview of the history of math 
formulae in Wikipedia). 


However, image formats are not directly interpretable and are, therefore, less machine-readable. 
Hence, the first step of analyzing mathematics in images is always converting into a more 
machine-readable, digital format. The majority of conversion approaches, including handwrit- 
ing recognition and Optical Character Recognition (OCR), focus on translations to MathML 
or BIEX [373, 406, 411]. Hence, for our task (translating presentational formats to computable 
formats), starting with image formats is not practically useful. 


Nonetheless, one particular issue in math OCR is also of interest for our translation task: 
detection of inline mathematics. In image formats, detecting inline mathematics is difficult 
because formulae may blend into texts [74, 125, 126, 230, 398]. Even a detection of italic fonts 
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can be a challenging task [66, 112, 113, 233]. A variable can easily be confused with words, 
such as the Latin letter ‘a? A similar issue raises in other formats, including KIEX documents 
and Wikipedia articles when an author does not correctly annotate mathematical formulae. 
In Wikipedia, for example, single identifiers in a text are often put in italic font rather than in 
mathematical environments. The capability of using UTF-8 encodings incites Wikipedia editors 
to put inline mathematics into the text directly, even when special characters are involved. 
For example, the mathematical expression 0 < ¢ < 4r in the English Wikipedia article about 
Jacobi polynomials?” is a sequence of UTF-8 characters and thus challenging to identify as 
mathematics for MathIR parser. Nevertheless, identifying all mathematical expressions in a 
document might be necessary for more reliable translations towards computable formats. For 
example, the mentioned relation of @ defines the domain of the Wigner d-matrix and is of 
interest for automatic evaluations (see Chapter 5). 


2.2.5 Math Embeddings 


Word embedding techniques has received significant attention over the last years in the Natural 
Language Processing (NLP) community, especially after the publication of word2vec [256]. 
Therefore, more and more projects try to adapt this knowledge for solving tasks in the MathIR 
arena [121, 15, 141, 215, 353, 360, 400, 404]. These projects try to embed math expressions into 
natural languages to create a vector representation of the formula. A vector representation is 
the data format with the highest machine readability among all other representations of math- 
ematical formula. The math embeddings successfully enabled a new approach to measure the 
similarity between math expressions, which is especially useful for math search, classification, 
and similar tasks [121, 215, 400, 404]. 


Considering the equation embedding techniques in [215], we devise three main types of math- 
ematical embedding: Mathematical Expressions as Single Tokens, Stream of Tokens, and Semantic 
Groups of Tokens. In the following we briefly explain each type on an example expression 
containing the inequality for Van der Waerden numbers 


W(2,k) > 2*/k®. (2.1) 


This expression is the first entry in the the MathML benchmark [18] we are going to explain in 
detail in Section 2.3. 


Mathematical Expressions as Single Tokens So called equation embeddings (EqEmb) 
were introduced by Krstovski and Blei [215] and use an entire mathematical expression as one 
token. In a one-token representation, the inner structure of the mathematical expression is not 
considered. For example, W (r, k) is represented as one single token tı. Any other expression, 
such as W (2, k) in the context, is an entirely independent token t2. Therefore, this approach 
does not learn any connections between W (2, k) and W (r, k). However, [215] has shown 
promising results for comparing mathematical expressions with this approach. 


Stream of Tokens As an alternative to embedding mathematical expressions as a single 
token, one can also represent an expression through a sequence of its inner elements. For 
example, considering only the identifiers in Equation (2.1), it would generate W, k, and £ as a 
sequence/stream of tokens. This approach has the advantage of learning all mathematical tokens. 


https: //en.wikipedia.org/wiki/Jacobi_polynomials#Applications [accessed 2021-10-01] 
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However, this method also has some drawbacks. Complex mathematical expressions may lead 
to long chains of elements, which can be especially problematic when the window size of the 
training model is too small. Naturally, there are approaches to reduce the length of chains. Gao et 
al. [121] use a continuous bag of words (CBOW) approach and embed all mathematical symbols, 
including identifiers and operands, such as +, — or variations of equalities =. Krstovski and 
Blei [215] also evaluated the stream of tokens approach but do not cut out symbols. They trained 
their model on the entire sequence of tokens that the BIFX tokenizer generates. Considering 
Equation (2.1), it would result in a stream of 13 tokens. They use a long short-term memory 
(LSTM) architecture to overcome the limiting window size and further limit chain lengths to 
20 — 150 tokens. Usually, in word embedding, such behaviour is not preferred since it increases 
the noise in the data. 


We [15] also use this stream of tokens approach to train our model on the DLMF without any 
filters. Thus, Equation (2.1) generates all 13 tokens. Later in Section 3.1, we show another model 
trained on the arXiv collection, which uses a stream of mathematical identifiers and cut out 
all other expressions, i.e., in case of (2.1), we embed W, k, and e. We presume this approach 
is more appropriate to learn connections between identifiers and their definiens. We will see 
later that both of our models trained on math embedding are able to detect similarities between 
mathematical objects, but does not perform well on detecting connections to word descriptors. 


Semantic Groups of Tokens The third approach of embedding mathematics is only the- 
oretical. Current MathIR and Machine Learning (ML) approaches would benefit from a basic 
structural knowledge of mathematical expressions, such that variations of function calls (e.g., 
W (r, k) and W (2, k)) can be recognized as the same function. Instead of defining a unified 
standard, current techniques use their ad-hoc interpretations of structural connections. We 
assume that an embedding technique would benefit from a system that can detect the parts of 
interest in mathematical expressions before any training process. However, such a system still 
does not exist. Later in Section 3.2, we will introduce a new concept to interpret logical groups 
of mathematical objects that may enable a semantic embedding in the future. 


It is important to mention that it remains unclear to what degree math semantic information 
can be embedded in a vector representation [9]. Since there is no answer to this question, we 
have not included math embeddings (i.e., vector representations of formulae) to Figure 2.1. 
Nonetheless, a vector representation can be decoded into a CAS syntax representation again 
to perform a ML based translation [296]. We will elaborate on such an approach more in 
Chapter 4. 


2.3 From Presentation to Content Languages 


We introduced several different formats for encoding mathematical formulae digitally and 
provided an overview of several existing conversion tools between these formats. Considering 
Figure 2.1, the goal of this thesis, i.e., making presentational math computable, requires to 
convert mathematical formats from the most left of the figure to the most right. We have 
chosen BIFX as the source format and general-purpose CAS syntaxes for the target formats. 
Considering the merit of communicating knowledge in sciences, it comes to no surprise that 
there are numerous of translation tools and theoretical approaches available to convert math 
formulae between multiple formats, including our goal translation from KIEX to CAS syntaxes. 
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Since MathML is the web standard which is supported by several CAS at least partially [57, 
110, 303, 338] (or OpenMath respectively), a translation from BIFX to CAS could be performed 
over MathML (preferably content MathML). In this section, we analyze state-of-the-art KIEX to 
MathML converters to study the applicability of using MathML as an intermediate format for 
translations from KIEX to CAS syntaxes. This section was previously published [18]. 


2.3.1 Background 


In the following, we use the Riemann hypothesis (2.2) as an example to explain typical challenges 
of converting different representation formats of mathematical formulae: 


¢(s) = 0 => Rs ivs 0. (2.2) 


We will focus on the representation of the formula in BIFX and in the format of the CAS 
Mathematica. BIFX is a common language for encoding the presentation of mathematical 
formulae. In contrast to KIEX, Mathematica’s representation focuses on making formulae 
computable. Hence the content must be encoded, i.e., both the structure and the semantics of 
mathematical formulae must be taken into consideration. 


In BIFX, the Riemann hypothesis can be expressed using the following string: 


D Riemann hypothesis in KTEX 


\zeta (s) = 0 \Rightarrow \Re s = \frac12 \lor \Im s=0 
In Mathematica, the Riemann hypothesis can be represented as: 


& Riemann hypothesis in Mathematica 


Implies[Equal[Zeta[s], 0], Or[Equal[Re[ls], Rational[1, 2]], 
EquallIm[s], 0]]] 


The conversion between these two formats is challenging due to a range of conceptual and 
technical differences. 


First, the grammars underlying the two representation formats greatly differ. KIEX uses the 
unrestricted grammar of the TeX typesetting system. The entire set of commands can be re- 
defined and extended at runtime, which means that TFX effectively allows its users to change 
every character used for the markup, including the \ character typically used to start commands. 
The large degree of freedom of the TEX grammar significantly complicates recognizing even 
the most basic tokens contained in mathematical formulae. In difference to BIFX, CAS use a 
significantly more restrictive grammar consisting of a predefined set of keywords and set rules 
that govern the structure of expressions. For example in Mathematica, function arguments 
must always be enclosed in square brackets and separated by commas. 


Second, the extensive differences in the grammars of the two languages are reflected in the 
resulting expression trees. Similar to parse trees in natural language, the syntactic rules of 
mathematical notation, such as operator precedence and function scope, determine a hierarchical 
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structure for mathematical expressions that can be understood, represented, and processed as a 
tree. The mathematical expression trees of formulae consist of functions or operators and their 
arguments. We used nested square brackets to denote levels of the tree and Arabic numbers 
in a gray font to indicate individual tokens in the markup. For the BIFX representation of the 
Riemann hypothesis, the expression tree is: 


& Representation tree of Riemann hypothesis in ATEX 
le esisi [na] wisi" 


The tree consists of 18 nodes, i.e., tokens, with a maximum depth of two (for the fraction 
command \frac12). The expression tree of the Mathematica expression consists of 16 tokens 
with a maximum depth of five: 


& Representation tree of Riemann hypothesis in Mathematica 
s Epee] [fewn Bra] Erao] 


The higher complexity of the Mathematica expression reflects that a CAS represents the content 
structure of the formula, which is deeply nested. In contrast, KIEX exclusively represents the 
presentational layout of the Riemann hypothesis, which is almost linear. 


For the given example of the Riemann hypothesis, finding alignments between the tokens 
in both representations and converting one representation into the other is possible. In fact, 
Mathematica and other CAS offer a direct import of TEX expressions, which we evaluate in 
Section 2.3.3. 


However, aside from technical obstacles, such as reliably determining tokens in TFX expressions, 
conceptual differences also prevent a successful conversion between presentation languages, 
such as TeX, and content languages. Even if there was only one generally accepted presentation 
language, e.g., a standardized TEX dialect, and only one generally accepted content language, 
e.g., a standardized input language for CAS, an accurate conversion between the representation 
formats could not be guaranteed. 


The reason is that neither the presentation language, nor the content language always provides 
all required information to convert an expression to the respective language. This can be 
illustrated by the simple expression: F(a + b) = Fa + Fb. The inherent content ambiguity of 
F prevents a deterministic conversion from the presentation language to a content language. F 
might, for example, represent a number, a matrix, a linear function or even a symbol. Without 
additional information, a correct conversion to a content language is not guaranteed. On the 
other hand, the transformation from content language to presentation language often depends 
on the preferences of the author and the context. For example, authors sometimes change the 
presentation of a formula to focus on specific parts of the formula or improve its readability. 


Another obstacle to conversions between typical presentation languages and typical content 
languages, such as the formats of CAS, are the restricted set of functions and the simpler 
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grammars that CAS offer. While TEX allows users to express the presentation of virtually 
all mathematical symbols, thus denoting any mathematical concept, CAS do not support all 
available mathematical functions or structures. A significant problem related to the discrepancy 
of the space of concepts expressible using presentation markup and the implementation of 
such concepts in CAS are branch cuts. Branch cuts are restrictions of the set of output values 
that CAS impose for functions that yield ambiguous, i.e., multiple mathematically permissible 
outputs. One example is the complex logarithm [98, ( 4.2.1)], which has an infinite set of 
permissible outputs resulting from the periodicity of its inverse function. To account for this 
circumstance, CAS typically restrict the set of permissible outputs by cutting the complex 
plane of permissible outputs. However, since the method of restricting the set of permissible 
outputs varies between systems, identical inputs can lead to drastically different results [3]. 
For example, multiple scientific publications address the problem of accounting for branch cuts 
when entering expressions in CAS, such as [109] for Maple. 


Our review of obstacles to the conversion of representation formats for mathematical formulae 
highlights the need to store both presentation and content information to allow for reversible 
transformations. Mathematical representation formats that include presentation and content 
information can enable the reliable exchange of information between typesetting systems and 
CAS. 


MathML offers standardized markup functionality for both presentation and content informa- 
tion. Moreover, the declarative MathML XML format is relatively easy to parse and allows for 
cross references between Presentation Language (PL) and Content Language (CL) elements. 
Listing 2.3 represents excerpts of the MathML markup for our example of the Riemann hypoth- 
esis (2.2). In this excerpt, the PL token 7 corresponds to the CL token 19, PL token 5 corresponds 
to CL token 20, and so forth. 


& Riemann hypothesis in MathML 


<math><semantics><mrow>... 

<mo id="5" xref="20">=</mo> 

<mn id="5" xref="21">0</mn> 

<mo id="7" xref="19">></ci>...</mrow> 
<annotation-xml encoding="MathML-Content"> 

<apply><implies id="19" xref="7"/> 

<apply><eq id="20" xref="5"/>... 

<apply><csymbol id="21" xref="1" cd="wikidata">Q187235</csymbol>... 
</annotation-xml></semantics></math> 


Listing 2.3: MathML representation of the Riemann hypothesis (2.2) (excerpt). 


Combined presentation and content formats, such as MathML, significantly improve the access 
to mathematical knowledge for users of digital libraries. For example, including content infor- 
mation of formulae can advance search and recommendation systems for mathematical content. 
The quality ofthese mathematical information retrieval systems crucially depends on the accu- 
racy ofthe computed document-query and document-document similarities. Considering the 
content information of mathematical formulae can improve these computations by: 
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1. enabling the consideration of mathematical equivalence as a similarity feature. Instead 
of exclusively analyzing presentation information as indexed, e.g., by considering the 
overlap in presentational tokens, content information allows modifying the query and 
the indexed information. For example, it would become possible to recognize that the 
expressions al? + 4) and atd have a distance of zero. 

2. allowing the association of mathematical tokens with mathematical concepts. For exam- 
ple, linking identifiers, such as E, m, and c, to energy, mass, and speed of light, could 
enable searching for all formulae that combine all or a subset of the concepts. 


3. enabling the analysis of structural similarity. The availability of content information 
would enable the application of measures, such as derivatives of the tree edit distance, 
to discover structural similarity, e.g., using A-calculus. This functionality could increase 
the capabilities of math-based plagiarism detection systems when it comes to identifying 
obfuscated instances of reused mathematical formulae [253]. 


Content information could furthermore enable interactive support functions for consumers and 
producers of mathematical content. For example, readers of mathematical documents could be 
offered interactive computations and visualizations of formulae to accelerate the understanding 
of STEM documents. Authors of mathematical documents could benefit from automated editing 
suggestions, such as auto completion, reference suggestion, and sanity checks, e.g., type and 
definiteness checking, similar to the functionality of word processors for natural language texts. 


2.3.1.1 Related Work 


A variety of tools exist to convert format representations of mathematical formulae. However, 
to our knowledge, Stamerjohanns et al. [351] presented the only study that evaluated the 
conversion quality of tools. Unfortunately, many of the tools evaluated by Stamerjohanns et 
al. are no longer available or out of date. Watt presents a strategy to preserve formula semantics 
in TEX to MathML conversions. His approach relies on encoding the semantics in custom TeX 
macros rather than to expand the macros [380]. Padovani discusses the roles of MathML and 
TEX elements for managing large repositories of mathematical knowledge [278]. Nghiem et al. 
used statistical machine translation to convert presentation to content language [271]. However, 
they do not consider the textual context of formulae. We will present detailed descriptions and 
evaluation results for specific conversion approaches in Section 2.3.3. 


Youssef addressed the semantic enrichment of mathematical formulae in presentation language. 
They developed an automated tagger that parses TeX formulae and annotates recognized 
tokens very similarly to Part-of-Speech (POS) taggers for natural language [402]. Their tagger 
currently uses a predefined, context-independent dictionary to identify and annotate formula 
components. Schubotz et al. proposed an approach to semantically enrich formulae by analyzing 
their textual context for the definitions of identifiers [329, 330]. 


With their ‘math in the middle approach’, Dehaye et al. envision an entirely different approach 
to exchanging machine readable mathematical expressions. In their vision, independent and 
enclosed virtual research environments use a standardized format for mathematics to avoid 
computions and transfers between different systems. [94]. 


For an extensive review of format conversion and retrieval approaches for mathematical for- 
mulae, refer to [326, Chapter 2]. 
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2.32 Benchmarking MathML 


This section presents MathMLben - a benchmark dataset for measuring the quality of MathML 
markup of mathematical formulae appearing in a textual context. MathMLben is an improve- 
ment of the gold standard provided by Schubotz et al. [329]. The dataset considers recent 
discussions of the International Mathematical Knowledge of Trust?” working group, in par- 
ticular the idea of a ‘Semantic Capture Language’ [165], which makes the gold standard more 
robust and easily accessible. MathMLben: 


e allows comparisons to prior works; 


covers a wide range of research areas in STEM literature; 


provides references to manually annotated and corrected MathML items that are compli- 
ant with the MathML standard; 


is easy to modify and extend, i.e., by external collaborators; 


includes default distance measures; and 


facilitates the development of converters and tools. 


In Section 2.3.2.1, we present the test collection included in MathMLben. In Section 2.3.2.2, we 
present the encoding guidelines for the human assessors and describe the tools we developed 
to support assessors in creating the gold standard dataset. In Section 2.3.2.3, we describe the 
similarity measures used to assess the markup quality. 


2.3.2.1 Collection 


Our test collection contains 305 formulae (more precisely, mathematical expressions ranging 
from individual symbols to complex multi-line formulae) and the documents in which they 
appear. 


Expressions 1 to 100 correspond to the search targets used for the ‘National Institute of 
Informatics Testbeds and Community for Information access Research Project’ (NTCIR) 11 
Math Wikipedia Task [329]. This list of formulae has been used for formula search and content 
enrichment tasks by at least 7 different research institutions. The formulae were randomly 
sampled from Wikipedia and include expressions with incorrect presentation markup. 


Expressions 101 to 200 are random samples taken from the NIST DLMF [98]. The DLMF 
website contains 9,897 labeled formulae created from semantic KIEX source files [77, 78]. In 
contrast to the examples from Wikipedia, all these formulae are from the mathematics research 
field and exhibit high quality presentation markup. The formulae were curated by renowned 
mathematicians and the editorial board keeps improving the quality of the formulae’s markup*!. 
Sometimes, a labeled formula contains multiple equations. In such cases, we randomly chose 
one of the equations. 


Expressions 201 to 305 were chosen from the queries of the NTCIR arXiv and NTCIR-12 
Wikipedia datasets. 70% of these queries originate from the arXiv [22] and 30% from a Wikipedia 
dump. 


"http://imkt.org/ [accessed 2021-08-03] 
*http://dlmf .nist.gov/about/staff [accessed 2021-08-03] 
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All data is openly available for research purposes and can be obtained from: https: //mathm 
lben.wmflabs.org”. 


2.3.2.2 Gold Standard 


We provide explicit markup with universal, context-independent symbols in content MathML. 
Since the symbols from the default content dictionary of MathML’? alone were insufficient to 
cover the range of semantics in our collection, we added the Wikidata content dictionary [328]. 
As a result, we could refer to all Wikidata items as symbols in a content tree. This approach has 
several advantages. Descriptions and labels are available in many languages. Some symbols 
even have external identifiers, e.g., from the Wolfram Functions Site, or from stack-exchange 
topics. All symbols are linked to Wikipedia articles, which offer extensive human-readable 
descriptions. Finally, symbols have relations to other Wikidata items, which opens a range of 
new research opportunities, e.g., for improving the taxonomic distance measure [336]. 


Our Wikidata-enhanced, yet standard-compliant MathML markup, facilitates the manual cre- 
ation of content markup. To further support human assessors in creating content annotations, 
we extended the VMEXT visualization tool [331] to develop a visual support tool for creating 
and editing the MathMLben gold standard. 


Table 2.3: Special content symbols added to KIExmı for the creation of the gold standard. 


No. | Rendering | Meaning Example IDs 
1 [x,y] commutator 91 

2 un | tensor | 43, 208, 226 

3 at | adjoint | 224, 277 

4 x transformation 20 

5 TE | degree | 20 

6 „(dim) | contraction | 225 


For each formula, we saved the source document written in different dialects of KIEX and 
converted it into content MathML with parallel markup using KTExmr [135, 257]. KIExML is a 
Perl program that converts BIFX documents to XML and HTML. We chose KIExmL, because 
it is the only tool that supports our semantic macro set. We manually annotated our dataset, 
generated the MathML representation, manually corrected errors in the MathML, and linked 
the identifiers to Wikidata concept entries whenever possible. Alternatively, one could initially 
generate MathML using a CAS and then manually improve the markup. 


Since there is no generally accepted definition of expression trees, we made several design 
decision to create semantic representations of the formulae in our dataset using MathML trees. 
In some cases, we created new macros to be able to create a MathML tree for our purposes 
using KIExmr?*. Table 2.3 lists the newly created macros. Hereafter, we explain our decisions 
and give examples of formulae in our dataset that were affected by the decisions. 


*Visit https: //mathmlben.wmf labs. org/about for a user guide [accessed 2021-08-03]. 

®http://www.openmath.org/cd [accessed 2021-08-03] 

“http: //dlmf.nist .gov/latexml/manual/customization/customization.latexml.htm1#SS1. 
SSSO.Px1 [accessed 2021-08-03] 
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not assign Wikidata items to basic mathematical identifiers and functions like factorial, 
\log, \exp, \times, \pi. Instead, we left these annotations to the DLMF BIFX macros, 
because they represent the mathematical concept by linking to the definition in the DLMF 
and KIExML creates valid and accurate content MathML for these macros [GoldID 3, 11, 
19, a; 


split up indices and labels of elements as child nodes of the element. For example, we 
represent i as a child node of p in p_i [GoldID 29, 36, 43, ...]; 


create a special macro to represent tensors, such as for T g [GoldID 43], to represent 
upper and lower indices as child nodes (see table 2.3); 


create a macro for dimensions of tensor contractions [GoldID 225], e.g., to distinguish 
the three dimensional contraction of the metric tensor in g(?) from a power function (see 
table 2.3); 


chose one subexpression randomly if the original expression contained lists of expressions 
[GoldID 278]; 


remove equation labels, as they are not part of the formula itself. For example, in 
E= me, (x) 


the (x) is the ignored label; 


remove operations applied to entire equations, e.g., applying the modulus. In such cases, 
we interpreted the modulus as a constraint of the equation [GoldID 177]; 


use additional macros (see table 2.3) to interpret complex conjugations, transformation 
signs, and degree-symbols as functional operations (identifier is a child node of the 
operation symbol), e.g., * or \dagger for complex conjugations [GoldID 224, 277], S? for 
transformations [GoldID 20], 30°\circ for thirty degrees [Gold ID 30]; 


for formulae with multiple cases, render each case as a separate branch [GoldID 49]; 


render variables that are part of separate branches in bracket notation. We implemented 
the Dirac Bracket commutator [] (omitting the index _\text{DB}) and an anticommutator 
by defining new macros (see table 2.3). Thus, there is a distinction between a (ring) 
commutator [a,b] = ab - ba and an anticommutator {a,b} = ab + ba, without 
further annotation of Dirac or Poisson brackets [GoldID 91]; 


use the command \operatorname{} for multi-character identifiers or operators [GoldID 
22]. This markup is necessary, because most BIFX parsers, including KIExML, interpret 
multi-character expressions as multiplications of the characters. In general, this inter- 
pretation is correct, since it is inconvenient to use multi-character identifiers [54]. 


Some of these design decisions are debatable. For example, introducing a new macro, such as 
\identifiername{}, to distinguish between multi-character identifiers and operators might 
be advantageous to our approach. However, introducing many highly specialized macros is 
likely not a viable approach and exaggerated. A borderline example in regard to this prob- 
lem is Az [GoldID 280]. Formulae of this form could be annotated as \operatorname{}, 
\identifiername{} or more generally as \expressionname{}. We interpret A as a differ- 
ence applied to a variable, and render the expression as a function call. 
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Figure 2.4: Graphical User Interface (GUI) to support the creation of our gold standard. The 
interface provides several TeX input fields (left) and a mathematical expression tree rendered 
by the VMEXT visualization tool (right). 


Similar cases of overfeeding the dataset with highly specialized macros are bracket notations. 
For example, the bracket (Dirac) notation, e.g., [GoldID 209], is mainly used in quantum physics. 
The angle brackets for the Dirac notation, ( and ), and a vertical bar | is already interpreted 
correctly as "latexml - quantum-operator-product". However, a more precise distinction between 
a twofold scalar product, e.g., (a|b), and a threefold expectation value, e.g., (a| Ala), might 
become necessary in some scenarios to distinguish between matrix elements and a scalar 
product. 


We developed a Web application to create and cultivate the gold standard entries, which is 
available at: https : //mathmlben . wmflabs . org/. The GUI provides the following 
information for each Gold ID entry. 


Formula Name: the name of the formula (optional) 


Formula Type: either definition, equation, relation or General Formula (if none of the 
previous names fit) 


Original Input Tx: the KIEX expression extracted from the source 


Corrected TEX: the manually corrected BIFX expression 


Hyperlink: the hyperlink to the position of the formula in the source 


Semantic BIFX Input: the manually created semantic version of the corrected BIFX 
field. This entry is used to generate our MathML with Wikidata annotations. 
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+ Preview of Corrected BIFX: a preview of the corrected BIFX input field rendered as 
an SVG image in real time using Mathoid [335], a service to generate SVGs and MathML 
from KIEX input. It is shown in the top right corner of the GUI. 


e VMEXT Preview: the VMEXT field renders the expression tree based on the content 
MathML. The symbol in each node is associated with the symbol in the cross referenced 
presentation MathML. 


Figure 2.4 shows the GUI that allows to manually modify the different formats of a formula. 
While the other fields are intended to provide additional information, the pipeline to create and 
cultivate a gold standard entry starts with the semantic BIFX input field. KIExmı will generate 
content MathML based on this input and VMEXT will render the generated content MathML 
afterwards. We control the output by using the DLMF KIEX macros [260] and our developed 
extensions. The following list contains some example of the DLMF BIFX macros. 


« \EulerGamma@{z}: T(z): gamma function, 
e \BesselJ{\nu}@{z}: J,,(z): Bessel function of the first kind, 


+ \LegendreQ [\mu] {\nu}@{z}: Q#(z): 


associated Legendre function of the second kind, 


. \JacobiP{\alpha}{\beta}{n}e{x}: PL” (x): 
Jacobi polynomial. 


The DLMF web pages, which we use as one of the sources for our dataset, were generated 
from semantically enriched BIFX sources using KIExmL. Since KIExML is capable to interpret 
semantic macros, generates content MathML that can be controlled with macros, and is easily 
extensible by new macros, we also used KIExML to generate our gold standard. While the DLMF 
is a compendium for special functions, we need to annotate every identifier in the formula with 
semantic information. Therefore, we extended the set of semantic macros. 


In addition to the special symbols listed in Table 2.3, we created macros to semantically enrich 
identifiers, operators, and other mathematical concepts by linking them to their Wikidata items. 
As shown in Figure 2.4, the annotations are visualized using yellow info boxes appearing on 
mouse over. The boxes show the Wikidata QID, the name, and the description (if available) of 
the linked concept. 


Aside from naming, classifying, and semantically annotating each formula, we performed three 
other tasks: 


e correcting the BIFX string extracted from the sources; 
e checking and correcting the MathML generated by KIExML 


e visualizing the MathMl using VMEXT 


Most of the extracted formulae contained concepts to improve human readability of the source 
code, such as commented line breaks, %\n, in long mathematical expressions, or special macros 
to improve the displayed version of the formula, e.g., spacing macros, delimiters, and scale 
settings, such as \!, \, or \>. Since they are part of the expression, all of the tested tools 
(also EIExML) try to include these formating improvements into the MathML markup. For our 
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gold standard, we focus on the pure semantic information and forgo formating improvements 
related to displaying the formula. The corrected TEX field shows the cleaned mathematical BIFX 
expression. 


Using the corrected TFX field and the semantic macros, we were able to adjust the MathML 
output using KIExML and verify it by checking the visualization from VMEXT. 


2.3.2.3 Evaluation Metrics 


To quantify the conversion quality of individual tools, we computed the similarity of each 
tool’s output and the manually created gold standard. To define the similarity measures for 
this comparison, we built upon our previous work [336], in which we defined and evaluated 
four similarity measures: taxonomic distance, data type hierarchy level, match depth, and 
query coverage. The measures taxonomic distance and data type hierarchy level require the 
availability of a hierarchical ordering of mathematical functions and objects. For our use case, 
we derived this hierarchical ordering from the MathML content dictionary. The measures assign 
a higher similarity score if matching formula elements belong to the same taxonomic class. 
The match depth measure operates under the assumption that matching elements, which are 
more deeply nested in a formula’s content tree, i.e., farther away from the root node, are less 
significant for the overall similarity of the formula, hence are assigned a lower weight. The 
query coverage measure performs a simple ‘bag of tokens’ comparison between two formulae 
and assigns a higher score the more tokens the two formulae share. 


In addition to these similarity measures, we also included the tree edit distance. For this purpose, 
we adapted the robust tree edit distance (RTED) implementation for Java [288]. We modified 
RTED to accept any valid XML input and added math-specific ‘shortcuts’, i.e., rewrite rules that 
generate lower distance scores than arbitrary rewrites. For example, rewriting ¢ to ab”! causes 
a significant difference in the expression tree: Three nodes (^, —, 1) are inserted and one node 
is renamed + — -. The ‘costs’ for performing these edits using the stock implementation of 
RTED are c = 3i + r. However, the actual difference is an equivalence, which we think should 
be assigned a cost of e < 3i + r. We sete < r <i. 


2.3.3 Evaluation of Context-Agnostic Conversion Tools 


This section presents the results of evaluating existing, context-agnostic conversion tools for 
mathematical formulae using our benchmark dataset MathMLben (cf. Section 2.3.2). We compare 
the distances between the presentation MathML and the content MathML tree of a formula 
yielded by each tool to the respective trees of formulae in the gold standard. We use the 
tree edit distance with customized weights and math-specific shortcuts. The goal of shortcuts 
is eliminating notational-inherent degrees of freedom, e.g., additional PL elements or layout 
blocks, such as mrow or mfenced. 


2.3.3.1 Tool Selection 


We compiled a list of available conversion tools from the W3C* wiki, from GitHub, and from 
questions about automated conversion of mathematical KIEX to MathML on Stack Overflow. 
We selected the following converters: 


https: //www.w3.org/wiki/Math_Tools [accessed 2021-08-03] 


Chapter 2 
Mathematical Information Retrieval 


Section 2.3. From Presentation to Content Languages 


KIExML: can convert generic and semantically annotated BIFX expressions to XML/ 
HTML/MathML. The tool is written in Perl [257] and is actively maintained. KIExML was 
specifically developed to generate the DLMF web page and can therefore parse entire TEX 
documents. KIExML also supports conversions to content MathML. 


LaTeX2MathML: is a small python project and is able to generate presentation MathML 
from generic KIRX expressions [245]. 


Mathoid: is a service developed using Node.js, PhantomJS and MathJax (a javascript 
display engine for mathematics) to generate SVGs and MathML from KIEX input. Mathoid 
is currently used to render mathematical formulae on Wikipedia [335]. 


SnuggleTeX: is an open-source Java library developed at the University of Edin- 
burgh [251]. The tool allows to convert simple KIEX expression to XHTML and 
presentation MathML. 


MathToWeb: is an open-source Java-based web application that generates presentation 
MathML from KIEX expressions®. 


TeXZilla: is a javascript web application for BIFX to MathML conversion capable of 
handling Unicode characters”. 


Mathematical: is an application written in C and wrapped in Ruby to provide a fast 
translation from BIEX expressions to the image formats SVG and PNG. The tool also 
provides translations to presentation MathML*®. 


CAS: we included Mathematica, which is capable of parsing KIEX expressions. 


Part-of-Math (POM) Tagger: is a grammar-based BIEX parser that tags recognized tokens 
with information from a dictionary [402]. The POM tagger is currently under develop- 
ment. In this paper, we use the first version. In [3], this version was used to provide 
translations KIRX to the CAS Maple. In its current state, the program offers no export to 
MathML. We developed an XML exporter to be able to compare the tree provided by the 
POM tagger with the MathML trees in the gold standard. 


2.3.3.2 Testing framework 


We developed a Java-based framework that calls the programs to parse the corrected TFX input 
data from the gold standard to presentation MathML, and, if applicable, to content MathML. In 
case of the POM tagger, we parsed the input string to a general XML document. We used the 
corrected TEX input format instead of the originally extracted string expressions, see 2.3.2.2. 


Executing the testing framework requires the manual installation of the tested tools. The POM 
tagger is not yet publicly available. 


2.3.3.3 Results 


Figure 2.5 shows the averaged structural tree edit distances between the presentation trees 
(blue) and content trees (orange) of the generated MathML files and the gold standard. To 


https: //www.mathtowebonline.com [accessed 2021-08-03] 
https: //fred-wang.github.io/TeXZilla [accessed 2021-08-03] 
*https://github.com/gjtorikian/mathematical [accessed 2021-08-03] 
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Figure 2.5: Overview of the structural tree edit distances (using r = 0, i = d = 1) between the 
MathML trees generated by the conversion tools and the gold standard MathML trees. 


calculate the structural tree edit distances, we used the RTED [288] algorithm with costs of 
i = 1 for inserting, d = 1 for deleting and r = 0 for renaming nodes. Furthermore, the Figure 
shows the total number of successful transformations for the 305 expressions (black ticks). 
Note that we also consider differences of the presentation tree to the gold standard as deficits, 
because the mapping from BIFX expressions to rendered expressions is unique (as long as the 
same preambles are used). A larger number indicates that more elements of an expression were 
misinterpreted by the parser. However, certain differences between presentation trees might be 
tolerable, e.g., reordering commutative expressions, while differences between content trees are 
more critical. Also note that improving content trees may not necessarily improve presentation 
trees and vice versa. In case of f(x + y), the content tree will change depending whether f 
represents a variable or a function, while the presentation tree will be identical in both cases. In 
contrast, 5 ah, and a/b have different presentation trees, while the content trees are identical. 


Figure 2.6 illustrates the runtime performance of the tools. We excluded the CAS from the 
runtime performance tests, because the system is not primarily intended for parsing KIEX ex- 
pressions, but for performing complex computations. Therefore, runtime comparisons between 
a CAS and conversion tools would not be representative. We measured the times required to 
transform all 305 expressions in the gold standard and write the transformed MathML to the 
storage cache. Note that the native code of LaTeX2MathML, Mathematical and KIExML were 
called from the Java Virtual Machine (JVM) and Mathoid was called through local web-requests, 
which increased the runtime of these tools. The figure is scaled logarithmically. We would 
like to emphasize that KIExmı is designed to translate sets of BIFX documents instead of single 
mathematical expressions. Most of the other tools are lightweight engines. 
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Figure 2.6: Time in seconds required by each tool to parse the 305 gold standard KIEX expressions 
in logarithmic scale. 


In this benchmark, we focused on the structural tree distances rather than on distances in 
semantics. While our gold standard provides the information necessary to compare the extracted 
semantic information, we will focus on this problem in future work. 


2.3.4 Summary of MathML Converters 


We make available the first benchmark dataset to evaluate the conversion of mathematical 
formulae between presentation and content formats. During the encoding process for our 
MathML-based gold standard, we presented the conceptual and technical issues that conversion 
tools for this task must address. Using the newly created benchmark dataset, we evaluated 
popular context-agnostic KIEX-to-MathML converters. We found that many converters simply 
do not support the conversion from presentation to content format, and those that did often 
yielded mathematically incorrect content representations even for basic input data. These 
results underscore the need for future research on mathematical format conversions. 


Of the tools we tested, KIExmL yielded the best conversion results, was easy to configure, 
and highly extensible. However, these benefits come at the price of a slow conversion speed. 
Due to its comparably low error rate, we chose to extend the KIExML output with semantic 
enhancements. 


2.4 Mathematical Information Retrieval for LaTeX Translations 


In the following, we will briefly discuss related work in the Mathematical Information Retrieval 
(MathIR) arena in order to find existing practical approaches for a translation from presen- 
tational to computable formats. MathIR is the research area that aims to retrieve additional 
(generally semantic) information about mathematical content [141]. In turn, the task of trans- 
lating mathematical presentational formats to computable formats is part of this research area 
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since it requires a context-dependent semantification®’, i.e., the semantic enhancement or en- 
richment of mathematical objects with additional information. One of the most well-studied 
tasks in MathIR® is searching for relevant mathematical expressions or content [21, 22, 241, 346, 
405, 408]. However, successful solutions in this area focus on similarity measures and do not 
necessarily require a deep understanding of the meaning and content of a formula. Likewise, 
other tasks in MathIR, such as entity linking, use similarity measures to retrieve connections 
between entities rather than semantic relatedness [208, 319, 321]. Thus, many related work in 
MathIR is not particularly beneficial for translating presentational encodings to computable 
formats. One of the reasons for this research gap is presumably a semantic version of the chicken 
or the egg causality dilemma. On the one hand, semantically enriching mathematical objects in 
an expression require identifying the meaningful objects. On the other hand, identifying those 
meaningful objects requires semantic information about those objects. In other words, if we 
want to annotate PP (x) with Jacobi polynomial in our use case equation (1.1), we need to 


know that PP (ax) refers to the Jacobi polynomial. 


Figure 2.7 illustrates this issue by splitting a math expression into four layers of mathematical 
objects. The identifier layer contains all identifiers (which may include general symbols and 
numbers too). The arithmetic layer contains arithmetic structures that combine tokens from 
the identifier layer to mathematical terms. This layer may include logic terms, sets, and other 
mathematical concepts with specific notations. The function layer combines elements from the 
lower layers to entire function calls. The top expression layer contains entire expressions in 
documents which are often a composition of elements in the previous layers. The difference of 
elements in the function and arithmetic layer is the ambiguity of the notations. Elements in 
the arithmetic layer generally do not need to be mapped to specific keywords in CAS because 
they are often semantically unique. In contrast, elements in the function layer are potentially 
ambiguous. However, a clear distinction between both layers is not always necessary and 
may even confuse in other MathIR related scenarios. For our task, the distinction is beneficial 
because elements in the function layer must be mapped to specific keywords in the CAS syntax, 
while elements in the arithmetic layer can be mostly ignored. 


Existing MathIR tasks focus on semantically enhancing either the expression [208, 209, 215], 
arithmetic [93, 242, 339], or the identifier [121, 279, 329, 330, 339, 400] layer, missing the 
important function layer entirely. An algorithm needs to understand the involved functions to 
identify objects in the function layer. This dilemma is usually avoided in MathIR tasks since 
objects in the other layers can be extracted primarily context-independently. The meaning of 
arithmetic operators usually does not change (e.g., +, —, or /) and math identifiers can often be 
presumed to be Latin or Greek letters. The function layer, however, contains the most crucial 
objects for the translation task. Identifiers generally represent mutable objects, such as variables 
or parameters, and do not require specific mapping rules. Similarly, arithmetic operations are 
natively supported by most mathematical software. Finally, objects in the expression layers are 
often too abstract (because they are compositions of multiple objects) and cannot be mapped 
as a whole to a single logic procedure in a computable format. 


There are approaches available that try to semantically enrich elements in the function layer. 
However, most of these semantic enrichment approaches focus solely on mathematical ex- 
pressions themselves and do not analyze textual information [159, 259, 270, 339, 364, 374]. 


® Also often called semantic enrichment. 
“For an extensive review of retrieval approaches for mathematical formulae, see also [326, Chapter 2]. 
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Figure 2.7: Four different layers of math objects in a single mathematical expression. The red 
highlights in the function and arithmetic layer refer to the fixed structure (or stem) of the 
function or operator. Gray tokens are mutable. Elements in the arithmetic layer are generally 
understood without further mappings and are mostly context-independent while elements in 
the function layer must be mapped to specific procedures in CAS and require disambiguation. 
However a strict distinction is not always required and might be even confusing. For example, 
n! is mostly understood by CAS and context-independent but can (and sometimes should) be 
mapped to the specific factorial procedure making it more to an element of the function layer. 


Approaches that take the textual context of a formula into account, on the other hand, do not 
semantically enrich objects in the function layer. Instead, they focus on other specific appli- 
cations, including math embeddings with the goal of a semantic vector representation [121, 
215, 360, 400, 404], entity linking [208, 212, 316, 321], math word problem solving [285, 409], 
semantic annotation [183, 214, 279, 329, 330], and context-aware math search engines [93, 122, 
124, 145, 210, 211, 232, 273, 314, 315, 366]. Regarding translating mathematical expressions from 
a lower level of semantics to a higher level, relevant literature is limited. The main relevant 
related literature for our task include semantic tagging [71, 402], annotations [139, 183, 214, 279, 
329, 330], and term disambiguations [339]. In the following, we distinguish semantic tagging 
(the task of precisely tagging math objects with a pre-defined set of semantic tags) and semantic 
annotation (the task of adding any number of relevant descriptions to math objects). 


Semantic Tagging and Term Disambiguation Semantic tagging of mathematical tokens 
has rarely been studied in the past and has not reached a well-established reliability level yet. To 
the best of our knowledge, only Chien et al. [71] (2015) and Youssef [402] (2017) addressed the 
issue for semantic tokenization of math formulae. Youssef [402] created the POM tagger, which 
tags tokens in the KIEX parse tree with additional information from a manually crafted lexicon. 
The POM tagger is still a work in progress and does not perform disambiguation steps yet. In the 
future, it is planned to reduce the number of possible tags for a token by analyzing the textual 
context and eliminating false tags. Ideally, the extracted context information results in a single, 
unique tag for each token. However, no update of the POM tagger, including the disambiguation 
steps, has been published so far. Recently, however, Shan and Youssef [339] presented several 
machine learning approaches as the first step towards disambiguation of mathematical terms. 
They trained different models on the semantic DLMF dataset and successfully disambiguated 
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prime notations with an F'l score of 0.83. However, if the models only adapted the relatively 
strict DLMF notation style for primes or if they are also able to disambiguate other real-world 
data has not been discussed. 


Chien et al. [71] proposed a probabilistic model on entire document collections to conclude 
semantic tags of mathematical tokens. They focused on tagging single identifiers (i.e., no groups 
of tokens). They constituted that the consistency property and user habits are critical aspects for 
successful tag disambiguation. With user habits, the authors referred to the different education 
levels and expertise of users so that a model can predict the preferred notation for specific 
semantics. The consistency property refers to the assumption that the meaning of a single 
term does not change within a certain context, e.g., a document. Recent efforts on annotating 
mathematical symbols by Asakura et al. [1], however, indicate that the scope of consistent tags 
could be significantly smaller than an entire document or a document collection. The semantics 
of frequently used symbols, such as x or t, may even change within single paragraphs. Another 
interesting counterexample is the connection between Euler numbers and Euler polynomials [98, 
(24.2.9)] in j 

E, = 2” En (3) . (2.3) 
While clearly connected, the first E refers to the Euler number but the second E refers to Euler 
polynomials. This underlines that under special circumstances, even within the scope of a single 
equation, an identifier may refer to two different mathematical concepts. Chien et al. reported 
a maximum accuracy of 0.94. 


Semantic Annotation Task While the task of semantic annotation has been studied more 
comprehensively, none of these existing approaches tried to convert the source expressions 
into a computable format [139, 183, 214, 279, 329, 330]. Grigore et al. [139], Nghiem et al. [269], 
Pagel et al. [279], Schubotz et al. [329, 330], and Kristianto et al. [214] analyze nouns or noun 
phrases in the surrounding context of a formula to semantically annotate an entire expression 
or parts of an expression. Only Grigore et al. [139] tried to use this information to perform a 
translation to a semantically enhanced format, here content MathML. The authors deduced a 
CD entry for a math symbol by calculating the similarity of the nouns surrounding the symbol 
and the textual description (or more precisely: the cluster of nouns in that description) of the 
CD entry. They measured the similarity with distributional properties from WordNet [261]. The 
other approaches either use the gained semantic information to improve search engines [214, 
269] or enable entity linking [279, 329, 330]. While other semantification approaches exist that 
elevate source presentational formats to a semantically enriched format [245, 251, 257, 270, 271, 
364, 391], none of them take the textual context into account. Some of them, however, perform 
disambiguation steps by considering other mathematical expressions in the same document 
(again presuming a semantic consistency of math notation within a single document as proposed 
by Chien et al. [71]) [270, 271]. None of the previous work considered the possibility of an 
identifier that has multiple meanings within a single formula, as shown in equation (2.3). 


Summary In summary, semantic enriching approaches avoid the essential function layer [159, 
259, 270, 364, 374], ignore the textual context surrounding a formula [71, 245, 251, 257, 270, 271, 
296, 364, 391], or does not use the extracted information for a translation towards a semantic 
enhanced format [183, 214, 279, 329, 330, 402]. Nonetheless, the related work underlines the 
benefits of analyzing the textual context of a formula. More importantly, the research has 
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shown that even simple noun phrase extraction provide viable information for numerous of 
applications [139, 183, 214, 279, 329, 330]. This motivated us to apply these promising approaches 
for our semantification pipeline too. 


Regarding the final translations towards computable formats, our comprehensive analysis of 
KIEX to MathML conversion tools in the previous section revealed that we probably gain no 
benefits from translating KIEX to MathML in an intermediate step. While many CAS provide 
import functions for MathML, there is no substantial support for OpenMath CDs. Another option 
would be OpenMath, since the SCSCP protocol uses OpenMath for inter-CAS communications. 
However, the SCSCP is relatively complex for our task and difficult to extend for new CAS if 
we do not have access to the internal libraries. Additionally, there are no translation tools from 
BIEX to OpenMath even though KIExmı can be exploited to realize rule-based translations. 


In a previous research project, we developed BCT, a semantic BIFX to CAS translator, specifi- 
cally for the DLMF [3, 13]. The goal of ACAST was to translate DLMF formulae, given in semantic 
BIEX, to the CAS Maple. The semantic BIFX macros reduced the ambiguity in mathematical 
expressions and enabled BCT to focus on other translation issues, such as definition disparity 
between the DLMF and Maple. Hence, we already established a reliable and expandable trans- 
lation pipeline from semantic BIFX to Maple. As a consequence, we focus our efforts on the 
more promising semantification of BIFX to semantic BIFX rather than from BIFX to content 
MathML in this thesis*'. 


This Chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License 


(http://creativecommons.org/licenses/by/4.0/). 


“Since the original development of BCasT was part of my Master’s thesis, the content of the associated early 
publications [3, 13] is not reused in this thesis. For more details about BCAST, see [13]. 
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In this chapter, we will focus on the research task II, i.e., we develop a new semantification 
process that addresses the issues of existing approaches outlined in the previous chapter. We 
identified two main issues with existing MathIR approaches for disambiguation and seman- 
tification of KIEX expressions. First, many semantification approaches solely focus on single 
tokens, such as identifiers, or the entire mathematical expression but miss to enrich the essential 
subexpressions between both extremes semantically. Second, existing translation approaches 
lack context sensitivity and disambiguate expressions by following an internal (often hidden) 
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context-agnostic decision process. This chapter addresses these issues within three parts. 
First, we elaborate on the capabilities of word embedding techniques to semantically enrich 
mathematical expressions. Second, we study the frequency distribution of mathematical subex- 
pressions in scientific corpora to understand the variety and complexity of subexpressions 
better. Third, we briefly outline a context-sensitive translation pipeline based on the gained 
knowledge from the first two parts. 


The primary goal of this chapter is to develop a context-sensitive KIRX to CAS translation 
pipeline. Unfortunately, it is not clear where we can find sufficient semantic information in 
the context to perform reliable translations. We can expect a certain amount of inclusive 
information in the given expression itself [54, 71, 394]. Additionally, related work has proven 
that noun phrases in the nearby textual context (such as the leading or following sentences of 
a formula) can successfully disambiguate math formulae [139, 209, 213, 329]. However, many 
functions are not necessarily declared in the surrounding context because the author presumes 
the interpretation is unambiguous. Wolska and Grigore [394] have shown that only around 
70% of mathematical identifiers are explicitly declared in the surrounding context. In this case, 
the location of the information that disambiguates the expression may vary greatly depending 
on many factors, such as the expected education level of the target audience of the article, the 
given references in the document, or even the author’s preferred notation style. One possible 
solution for exploiting this source of semantic information is to build a common knowledge 
database for mathematical expressions. 


As a first attempt to automatically build such a common knowledge database that stores the 
standard, i.e., most common, meanings of mathematical symbols, we explore the capabilities 
of machine learning algorithms in the first part of this chapter. Specifically, we use word 
embeddings to train common co-occurrences of mathematical and natural language tokens. 
We will show that this approach is not as successful as we hoped for our knowledge extraction 
task but enables new approaches for mathematical search engines. Further, the results will 
once again underline the issues with the interpretation of nested mathematical objects. Word 
embeddings for mathematical tokens are mainly unable to properly train the connections with 
defining expressions in the context because they still ignore the function layer of mathematical 
expressions. In the following, we focused our studies on mathematical subexpressions. 


As a thought experiment, consider mathematical expressions are like entire sentences in natural 
languages rather than single words. Following this analogy, entire math terms are analog to 
words, and the notation of mathematical expressions certainly follow a specific grammar [54]. 
However, our mathematical sentences have one distinct difference compared to natural language 
sentences. The grammar of mathematical expressions is built around a nested structure in 
contrast to the sequential order of words. For example, a math term representing a variable is 
a placeholder and can be replaced with arbitrarily complex and deeply nested subexpressions 
without violating any grammatical rules. This nested structure makes the semantic tokenization 
of mathematical expressions to a complex and eventually context-dependent task [71, 402]. In 
order to review our analogy, we perform the most extensive notation analysis of mathematical 
subexpressions (since those are the potential words) on two real-world scientific datasets. We 
discovered that the frequency distributions of mathematical objects obey Zipf’s law, similar 
to words in natural language corpora. In turn, we can use frequency-based retrieval functions 
to distinguish important or informative mathematical objects from stop-word-like structures. 
We coin these essential and informative objects Mathematical Objects of Interest (MOI). The 
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success of this new interpretation finally motivated us to move away from the established 
MathIR techniques that focus on single identifiers or entire math expressions to meaningful 
subexpressions. Hence, we conclude this chapter with an abstract context-sensitive translation 
approach that finally attributes to the nested grammar of mathematical formulae and is based 
on the new concept of MOI. 


In summary, this chapter is organized as follows. In Section 3.1, we explore the capabilities of 
word embeddings to discover common co-occurrences of natural language tokens and math 
tokens in large scientific datasets. In Section 3.2, we introduce the new concept of MOI and 
perform the first extensive frequency distribution study of mathematical notations in two large 
scientific corpora. Section 3.3 concludes the findings of the previous sections by introducing a 
novel context-sensitive translation approach from KTgxX to CAS expressions. Section 3.1 was 
published as an article in the Scientometrics journal [15]. Section 3.2 was published as full 
paper at the WWW conference [14]. Excerpts of Section 3.3 have been published at the ICMS 
conference in a full paper [10]. 


3.1 Semantification via Math-Word Embeddings 


Mathematics is capable of explaining complicated concepts and relations in a compact, precise, 
and accurate way. Learning this idiom takes time and is often difficult, even to humans. The 
general applicability of mathematics allows a certain level of ambiguity in its expressions. Short 
explanations or mathematical expressions are often used to mitigate the ambiguity problem, that 
serve as a context to the reader. Along with context-dependency, inherent issues of linguistics 
(e.g., ambiguity, non-formality) make it even more challenging for computers to understand 
mathematical expressions. Nevertheless, a system capable of automatically capturing the se- 
mantics of mathematical expressions would be suitable for improving several applications, from 
search engines to recommendation systems. Word embedding [33, 34, 43, 65, 73, 217, 222, 239, 
250, 255, 272, 293, 295] has made it possible to apply deep learning in NLP with great effect. 
That is because embedding represents individual words with numerical vectors that capture 
contextual and relational semantics of the words. Such representation enables inputting words 
and sentences to a Neural Network (NN) in numerical form. This allows the training of NNs 
and using them as predictive models for various NLP tasks and applications, such as semantic 
role modeling [149, 412], word-sense disambiguation [160, 305], sentence classification [186], 
sentiment analysis [344], coreference resolution [223, 388], named entity recognition [72], read- 
ing comprehension [75], question answering [234], natural language inference [69, 137], and 
machine translation [97]. The performance of word embedding in NLP tasks has been measured 
and shown to deliver fairly high accuracy [256, 293, 295]. 


As math text consists of natural text as well as math expressions that exhibit linear and contextual 
correlation characteristics that are very similar to those of natural sentences, word embedding 
applies to math text much as it does to natural text. Accordingly, it is worthwhile to explore 
the use and effectiveness of word embedding in Mathematical Language Processing (MLP), 
Mathematical Knowledge Management (MKM), and MathIR. Still, math expressions and math 
writing styles are different from natural text to the point that NLP techniques have to undergo 
significant adaptations and modifications to work well in math contexts. 


While some efforts have started to apply word embedding to MLP, such as equation embed- 
ding [121, 9, 215, 400, 404], there is a healthy skepticism about the use of ML and Deep Learning 
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(DL) in MLP and MKM, on the basis that much work is still required to prove the effective- 
ness of DL in MLP. To learn how to adapt and apply DL in the MLP/MKM/MathIR context is 
not an easy task. Most applications of DL in MLP/MKM/MathIR rest on the effectiveness of 
word/math-term embedding (henceforth math embedding) because the latter is the most basic 
foundation in language DL. Therefore, it behooves us to start to look at the effectiveness of 
math embedding in basic tasks, such as term similarity, analogy, information retrieval, and basic 
math search, to learn more about their extension and limitations. More importantly, we need 
to learn how to refine and evolve math embedding to become accurate enough for more severe 
applications, such as knowledge extraction. That is the primary objective of this section. 


To that effect, there is a fundamental need for datasets and benchmarks, preferably standard 
ones, to allow researchers to measure the performance of various math embedding techniques, 
and applications based on them, in an objective and statistically significant way, and to measure 
improvements and comparative progress. Such resources are abundant in the natural language 
domain but scarce in the MLP domain. Developing some of such datasets and benchmarks will 
hopefully form the nucleus for further development by the community to facilitate research 
and speed up progress in this vital area of research. 


While the task of creating such resources for DL applications in MLP can be long and demanding, 
the examination of math embedding should not wait but should proceed right away, even if 
in an exploratory manner. Early evaluations of math embedding should ascertain its value 
for MLP/MKM/MathIR and inform the process and trajectory of creating the corpora and 
benchmarks. Admittedly, until adequate datasets and benchmarks become available for MLP, 
we have to resort to less systematic performance evaluation and rely on performing preliminary 
tests on the limited resources available. The DLMF [98] and arXiv.org preprint archive! are 
good resources to start our exploratory embedding efforts. The DLMF offers high quality, and 
the authors are familiar with its structure and content (which aids in crafting some of the tests). 
As for the arXiv collection, its large volume of mostly math articles makes it an option worth 
to investigate as well. 


In this section, we provide an exploratory investigation of the effectiveness and use of word 
embedding in MLP and MKM through different perspectives. First, we train word2vec models 
on the DLMF and arXiv with slightly different approaches for embedding math. Since the 
DLMF is primarily a handbook of mathematical equations, it does not provide extensive textual 
content. We will show that the DLMF trained model is appropriate to discover mathematical 
term similarities and term analogies, and to generate query expansions. We hypothesize that 
the arXiv trained models are beneficial to extract definiens, i.e., textual descriptive phrases for 
math terms. We examine the possible reasons why the word embedding models, trained on the 
arXiv dataset, does not present valuable results for this task. Besides, we discuss some of the 
reasons that we believe thwart the progress in MathIR in the direction of machine learning. In 
summary, we focus on five tasks (i) term similarity, (ii) math analogies, (iii) concept modeling, 
(iv) query expansion, and (v) knowledge extraction. In the context of this thesis, we are mostly 
interested in the latter, i.e., knowledge extractions, and will solely focus on these experiments 
and results. For the tasks (i-iv), see [15]. 


‘https: //arxiv.org/ [accessed 2019-09-01] 
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3.1.1 Foundations and Related Work 


Understanding mathematical expressions essentially mean comprehending the semantic value 
of its internal components, which can be accomplished by linking its elements with their 
corresponding mathematical definitions. Current MathIR approaches [213, 329, 330] try to 
extract textual descriptors of the parts that compose mathematical equations. Intuitively, there 
are questions that arise from this scenario, such as (i) how to determine the parts which have 
their own descriptors, and (ii) how to identify correct descriptors over others. 


Answers to (i) are more concerned in choosing the correct definitions for which parts of a 
mathematical expression are considered as one mathematical object [197, 18, 402]. Current 
definition-languages, such as the content MathML 3.0? specification, are often imprecise*. For 
example, content MathML 3.0 uses ‘csymbol’ elements for functions and specifies them as 
expressions that refer to a specific, mathematically-defined concept with an external definition. 
However, in case of the Van der Waerden number, for instance, it is not clear whether W or 
the sequence W (r, k) should be declared as a ‘csymbol’. Another example involves content 
identifiers, which MathML specifies as mathematical variables that have properties, but no fixed 
value’. While content identifiers are allowed to have complex rendered structures (e.g., EAN it 
is not permitted to enclose identifiers within other identifiers. Let us consider a, where a is a 
vector and q; its i-th element. In this case, a; should be considered as a composition of three 
content identifiers, each one carrying its own individualized semantic information, namely the 
vector a, the element q; of the vector, and the index i. However, with the current specification, 
the definition of these identifiers would not be canonical. One possible workaround to represent 
such expressions with content MathML is to use a structure of four nodes, interpreting a, as 
a function via a ‘csymbol’ (one parent ‘apply’ node with the three children vector-selector, a, 
and i). However, ML algorithms and MathIR approaches would benefit from more precise 
definitions and a unified answer for (i). Most of the related work relies on these relatively vague 
definitions and in the analysis of content identifiers, focusing their efforts on (ii). 


Questions (i), (ii), and other pragmatic issues are already in discussion in a bigger context, as 
data production continues to rise and digital repositories seem to be the future for any archive 
structure. A prominent example is the National Research Council’s effort to establish what they 
call the Digital Mathematical Library (DML)°, a project under the International Mathematical 
Union. The goal of this project is to take advantage of new technologies and help to solve 
the inability to search, relate, and aggregate information about mathematical expressions in 
documents over the web. 


The advances most relevant to our work are the recent developments in word embedding [43, 
65, 73, 256, 293, 295, 313]. Word embedding takes as input a text collection and generates a 
numerical feature vector (typically with 100 or 300 dimensions) for each word in the collection. 
This vector captures latent semantics of a word from the contexts of its occurrences in the 


*https://www.w3.org/TR/MathML3/ [accessed 2019-09-01] 

Note that OpenMath is another specification designed to encode semantics of mathematics. However, content 
MathML is an encoding of OpenMath and inherent problems of content MathML also apply to OpenMath (see 
https: //www.openmath.org/om-mm1/ [accessed 2019-09-01]). 

“https: //www.w3.org/TR/\gls{mathm1}3/chapter4.html#contm.csymbol [accessed 2019-09-01] 

Shttps: //www.w3.org/TR/\gls{mathm1}3/chapter4 .html#contm. ci [accessed 2019-09-01] 

Shttps://www.nap.edu/read/18619 [accessed 2019-09-01] 
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collection; in particular, words that often co-occur nearby tend to have similar feature vectors 
(where similarity is measured by the cosine similarity, the Euclidean distance, etc.). 


Recently, more and more projects try to adapt these word embedding techniques to learn patterns 
of the correlations between context and mathematics. In the work of Gao et al. [121], they embed 
single symbols and train a model that can discover similarities between mathematical symbols. 
Similarly to this approach, Krstovski and Blei [215] uses a variation of word embedding to 
represent complex mathematical expressions as single unit tokens for IR. In 2019, Yusanaga and 
Lafferty [400] explore an embedding technique based on recurrent neural networks to improve 
topic models by considering mathematical expressions. They state their approach outperforms 
topic models that do not consider mathematics in text and report a topic coherence improvement 
of 0.012 over the LDA’ baseline. Equation embedding, as in [121, 215, 400], present promising 
results for identifying similar equations and contextual descriptive keywords. In the following, 
we will explore in more detail different techniques of word embedding. 


3.1.1.1 Word Embedding 


In this section, we apply word2vec [256] on the DLMF [98] and on the collection of arXiv docu- 
ments for generating embedding vectors for various math symbols and terms. The word2vec 
technique computes real-valued vectors for words in a document using two main approaches: 
skip-gram and continuous bag-of-words (CBOW). Both produce a fixed-length n-dimensional 
vector representation for each word in a corpus. In the skip-gram training model, one tries to 
predict the context of a given the word, while CBOW predicts a target word given its context. 
In word2vec, context is defined as the adjacent neighboring words in a defined range, called 
a sliding window. The main idea is that the numerical vectors representing similar words 
should have close values if the words have similar context, often illustrated by the king-queen 
relationship. 


Q King-Queen Relationship of Word-Embedding Vectors 


The king-queen relationship describes the similarity (in terms of the cosine distance 
between the vectors) of: 


S Treen — 0 (3.1) 


Uking — Uman queen woman? 


where wv, represents the vector for the token t. 


Extending word2vec’s approaches, Le and Mikolov [222] propose Paragraph Vectors, a frame- 
work that learns continuous distributed vector representations for any size of text segments 
(e.g., sentences, paragraphs, documents). This technique alleviates the inability of word2vec to 
embed documents as one single entity. This technique also comes in two distinct variations: 
Distributed Memory and Distributed Bag-of-Words, which are analogous to the skip-gram and 
CBOW training models, respectively. 


Other approaches also produce word embedding given a training corpus as input, such as 
fastText [43], ELMo [295], and GloVe [293]. The choice for word2vec for our experiments is 
justified because of its implementation ease, training speed using modest computing resources, 


"Latent Dirichlet Allocation 
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general applicability, and robustness in several NLP tasks [160, 161, 229, 238, 302, 312]. Ad- 
ditionally, in fastText they propose to learn word representations as a sum of the n-grams of 
its constituent characters (sub-words). The sub-word structure would incorporate a certain 
noise® to our experiments. In ELMo, they compute their word vectors as the average of their 
characters representations, which are obtained through a two-layer bidirectional language 
model (biLM). This would bring even more granularity than fastText, as they consider each 
character in a word as having their own n-dimensional vector representation. Another factor 
that prevents us from using ELMo, for now, is its expensive training process’. Closer to the 
word2vec technique, GloVe [293] is also considered, but its co-occurrence matrix would escalate 
the memory usage, making its training for arXiv not possible at the moment. We also examine 
the recently published Universal Sentence Encoder [65] from Google, but their implementa- 
tion does not allow one to use a new training corpus, only to access its pre-calculated vectors 
based on words. We also considered BERT [96] with its recent advances of Transformer-based 
architectures in NLP as an alternative to word2vec. However, incorporating BERT and other 
Transformer-based architectures would require a significant restructuring of the core idea of our 
work. BERT is pre-trained in two general tasks that are not directly transferable to mathematics 
embeddings: Masked Language Modelling and Next Sentence Prediction. Since this work is an 
exploratory investigation of the potential of word embedding techniques in MLP and MKM, we 
gave preference to tools that could be applied directly. Nonetheless, since some of our results 
are promising, we plan to include Transformer-based systems, such as BERT [96], XLNet [399], 
RoBERTa [235], and Transformers-XL [87], in future work. 


The overall performance of word embedding algorithms has shown superior results in many 
different NLP tasks, such as machine translation [256], relation similarity [161], word sense 
disambiguation [55], word similarity [268, 312], and topic categorization [301]. In the same 
direction, we also explore how well mathematical tokens can be embedded according to their 
semantic information. However, mathematical formulae are highly ambiguous and, if not 
properly processed, their representation is jeopardized. 


To investigate the situations described in Sections 3.1.1.1 and 2.2.5 we applied word2vec on 
two different scenarios, one focusing on MathIR (DLMF) and the other on semantic knowledge 
extraction (arXiv), i.e., identifying definiens for math objects. To summarize our decisions, for 
the DLMF and arXiv, we choose the stream of token embedding technique, i.e., each inner token 
is represented as a single n-dimensional vector in the embedding model. For the DLMF, we 
embed all inner tokens, while for the arXiv, we only embed the identifiers. In this thesis, we 
are more interested in applying math embeddings to semantic extraction task. The MathIR task 
is described in [15, Section 3]. 


3.1.2 Semantic Knowledge Extraction 


Extracting definiens of mathematical objects from a textual context is a common task in 
MathIR [214, 279, 329, 330, 405] that often provides a gold dataset for its evaluation. Since 
the DLMF does not provide extensive textual information for its mathematical expressions, we 
considered an alternative scenario in our analysis, one in which we trained a second word2vec 
model on a much larger corpus composed of articles/papers from the arXiv collection. In this 
section, we compare our findings against the approach by Schubotz et al. [330]. We apply varia- 


*Noise means, the data consists of many uninteresting tokens that affect the trained model negatively. 
https: //github.com/allenai/bilm-tf [accessed 2019-09-01] 
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tions of a word2vec [256] and paragraph vectors [222] implementation to extract mathematical 
relations from the arXMLiv 2018 [132] dataset (i.e., an HTML collection of the arXiv.org preprint 
archive!®), which is used as our training corpus. We also consider the subsets that do not report 
errors during the document conversion (i.e., no_problem and warning) which represent 70% of 
archive.org. We make the code, regarding our experiments, publicly available”. 


3.1.2.1 Evaluation of Math-Embedding-Based Knowledge Extraction 


As a pre-processing step, we represent mathematical expressions using the MathML” notation. 
First, we replace all mathematical expressions with the identifiers sequence it contains, i.e., 
W(2,k) is replaced by ‘W k’. We also add the prefix ‘math-’ to all identifier tokens to distin- 
guish between textual and mathematical terms later. Second, we remove all common English 
stopwords from the training corpus. Finally, we train a word2vec model (skip-gram) using the 
following hyperparameters!’: vector size of 300 dimensions, a window size of 15, minimum 
word count of 10, and a negative sampling of 1E — 5. We justify the hyperparameter used in 
our experiments based on previous publications using similar models [63, 221, 222, 255, 312]. 


In the following, distances between vectors are calculated via the cosine distance. The trained 
model was able to partially incorporate semantics of mathematical identifiers. For instance, 
the closest 27 vectors to the mathematical identifier f are mathematical identifiers themselves 
and the fourth closest noun vector to f is ‘function’. We observe that the results of the model 
trained on arXiv are comparable with our previous experiments on the DLMF. 


Previously, we used the semantic relations between embedding vectors to search for relevant 
terms in the model. Hereafter, we will refer to this algebraic property as semantic distance to a 
given term with respect to a given relation, i.e., two related vectors. For example, to answer 
the query/question: What is to ‘complex’ as z is to ‘real’, one has to find the closest semantic 
vectors to ‘complex’ with respect to the relation between x and ‘real’, i.e., finding v in 


U — Ucomplex 


= v, = Treal 

Instead of asking for mathematical expressions, we will now reword the query to ask for specific 
words. For example, to retrieve the meaning of f from the model, we can ask for: What is to 
f as ‘variable’ is to x? Or in other words, what is semantically close to f with respect to the 
relation between ‘variable’ and x? Table 3.1 shows the top 10 semantically closest results to f 


with respect to the relations between Variable and Uy, Uyariable and w,. 


and Uy» and Variable 


From Table 3.1, we can observe a similar behaviour. Later, we will explore that mathematical 
vectors build a cluster in the trained model, i.e., that the vectors of v+, v,, and U,, are close to 
each other with respect to the cosine similarity. This cluster, and the fact that we did not use 
stemming and lemmatization for preprocessing, explains that the top hit to the queries is always 
‘variables’. To refine the order of the extracted answers, we calculated the cosine similarity 
between vz and the vectors for the extracted words directly. Table 3.2 shows the cosine distances 
between v, and the extracted words from the query: Term is to f what ‘variable’ is to a. 


Vhttps://arxiv.org/ [accessed 2019-09-01] 

Unttps://github.com/ag-gipp/math2vec [accessed 2019-09-01] 

The source TFX file has to use mathematical environments for its expressions. 

Non mentioned hyperparameters are used with their default values as described in the Gensim API [307] 
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Table 3.1: Analogies of the form: Find the Term where Term is a word that is to X what Y is to Z. 


Top-10 best Terms and their cosine similarities where 


“Term is to f what | Term is to f what | Term is to f what 
‘variable’ is to x ‘variable’ is to y ‘variable’ is to a 
variables 0.7655 variables 0.7481 variables 0.7600 
independent 0.7411 function 0.7249 function 0.7154 
appropriate 0.7279 given 0.7103 | appropriate 0.6925 
means 0.7250 means 0.7083 | independent 0.6789 
ie 0.7234 ie 0.7067 instead 0.6784 
instead 0.7233 | independent 0.7030 defined 0.6729 
namely 0.7139 thus 0.6925 namely 0.6719 
function 0.7131 instead 0.6922 | continuous 0.6707 
following 0.7117 | appropriate 0.6891 depends 0.6629 
depends 0.7095 defined 0.6889 represents 0.6623 


Asking for the meaning of f is a very generic question. Thus, we performed a detailed evaluation 
on the first 100 entries! of the MathMLben benchmark [18]. We evaluated the average of the 
semantic distances with respect to the relations between Variable and Uy, Variable and U, and 
Uranction and Ue. We have chosen to test on these relations because we believe that these relations 
are the most general and still applicable, e.g., seen in Table 3.2. In addition, we consider only 
results with a cosine similarity equal to or greater than 0.70 to maintain a minimum quality 
in our experiments. The overall results were poor, with a precision of p = .0023 and a recall 
of r = .052. Despite the weak results, an investigation of some specific examples showed 
interesting characteristics; for example, for the identifier W, the four semantically closest 
results were functions, variables, form, and the mathematical identifier g. The poor performance 
illustrates that there might be underlying issues with our approach. However, as mentioned 
before, mathematical notation is highly flexible and content-dependent. Hence, in the next 
section, we explore a technique that rearranges the hits according to the actual close context of 
the mathematical expression. 


3.1.2.2 Improvement by Considering the Context 


We also investigate how a different word embedding technique would affect our experiments. To 
do so, we trained a Distributed Bag-of-Words of Paragraph Vectors (DBOW-PV) [222] model. We 
trained this DBOW-PV in the same corpus as our word2vec model (with the same preprocessing 
steps) with the following configuration: 400 dimensions, a window size of 25, and minimum 
count of 10 words. Schubotz et al. [330] analyze all occurrences of mathematical identifiers 
and consider the entire article at once. We believe this prevents the algorithm from finding 
the right descriptor in the text, since later or prior occurrences of an identifier might appear in 
a different context, and potentially introduce different meanings. Instead of using the entire 
document, we consider the algorithm by Schubotz et al. [330] only in the input paragraph and 


“Same entries used in [330] 
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Table 3.2: The cosine distances of f regarding to the hits in Table 3.1. 


Cosine distances between the 
Terms from Table 3.1 to f 


function 0.8138 
defined 0.7932 
independent 0.7323 
namely 0.7214 
depends 0.7022 
represents 0.6983 
instead 0.6837 
appropriate 0.6698 


continuous 0.6203 


variables 0.5638 


similar paragraphs given by our DBOW-PV model. Unfortunately, the obtained variance within 
the paragraphs brings a high number of false positives to the list of candidates, which affects 
the performance of the original approach negatively. 


As a second approach for improving our system, we considered a given textual context to 
reorder extracted words according to their cosine similarities to the given context. For example, 
consider the sentence: ‘Let f (x, y) be a continuous function where x and y are arbitrary values. 
We ask for the meaning of f concerning this given context sentence. The top-k closest words 
to f in the word2vec model only represent the distance over the entire corpus, in this case, 
arXiv, but not regarding a given context. To address this issue, we retrieved similar paragraphs 
to our context example via the DBOW-PV model and computed the weighted average distance 
between all top-k words, that are similar to f and the retrieved sentences. We expected that the 
word describing f in our example sentence would also present a higher cosine similarity to the 
context itself. Table 3.3 shows the top-10 closest words (i.e., we filtered out other math tokens) 
and their cosine similarity to f in the left column. The right column shows the average cosine 
similarities of the extracted words to the context example sentence we used and its retrieved 
similar sentences. 


As Table 3.3 illustrates, this context-sensitive approach was not beneficial; in fact it undermined 
our model. According to the fact that the identifier should be closer to the given context sentence 
rather than to the related sentences retrieved from the DBOW-PV model, we also explored the 
use of weighted average. However, the weighted average did not improve the results of the 
normal average. Other hyperparameters for the word embedding models were also tested in an 
attempt to tune our system. However, we could not determine any drastic changes regarding 
the measured performances. 


3.1.2.3 Visualizing Our Model 


Figure 3.1 illustrates four t-SNE[154] plots of our word2vec model. Since t-SNE plots may 
produce misleading structures [382], we plot four t-SNE plots with different perplexity values. 
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Table 3.3: We are looking for descriptive terms for f in a given context ‘Let f(x,y) be a 
continuous function where x and y are arbitrary values’. To achieve this, we retrieved close 
vectors to f and computed their distances to the given context sentence. To bring variety to 
the context, we used our DBOW-PV model to retrieve related sentences to the given context 
and computed the average distance of the words to these related sentences. 


Top-10 closes words (no After reordering the hits 
math symbols) to f and their according to their distances 
cosine similarities. to the context vector. 

given 0.8162 case 0.8568 
case 0.7960 corresponding 0.8562 
corresponding 0.7957 note 0.8451 
function 0.7900 thus 0.8414 
note 0.7803 obtain 0.8413 
thus 0.7726 ie 0.8335 
obtain 0.7712 since 0.8250 
value 0.7682 function 0.8086 
ie 0.7656 value 0.8015 
since 0.7583 given 0.7096 


Other parameters were set to their default values according to the t-SNE python package. 
We colored word tokens in blue and math tokens in red. The plots illustrate, though not 
surprisingly, that math tokens are clustered together. However, a certain subset of math tokens 
appear isolated from other math tokens. By attaching the content to some of the vectors, 
we can see that math tokens, such as and (an and within math mode) and im (most likely 
referring to imaginary numbers) are part ofa second cluster of math tokens. The plot is similar 
to the visualized model presented in [121], even though they use a different word embedding 
technique. Hence, the general structure within math word2vec models seems to be insensitive 
to the embedding technique of formulae used. Compared to [121], we provide a model with 
richer details that reveal some dense clusters, e.g., numbers (bottom right plot at (11,8)) or 
equation labels (bottom right plot at (—14, 0)). 


Based on the presented results, one can still argue that more settings should be explored (e.g., dif- 
ferent parameters and embedding techniques) for the embedding phase, different pre-processing 
steps (e.g., stemming and lemmatization) should be adopted, and post-processing techniques 
(e.g., boosting terms of interest based on a knowledge database such as OntoMathPro [104, 105]) 
should also be investigated. This presumably solves some minor problems, such as removing 
the inaccurate first hit in Table 3.1. Nevertheless, the overall results would not surpass the 
ones in [330], which reports a precision score of p = 0.48. On the grounds that mathematics is 
highly customizable, many of the defined relations between mathematical concepts and their 
descriptors are only valid in a local scope. Let us consider an author that notates his algorithm 
using the symbol . The author’s specific use of 7 does not change its general use, but it affects 
the meaning in the scope of the article. Current ML approaches only learn patterns of most 
frequently used combinations, e.g., between f and ‘function’, as seen in Table 3.1. 
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Figure 3.1: t-SNE plot of top-1000 closest vectors of the identifier f with perplexity values 5 
(top left), 10 (top right), 40 (bottom left), and 100 (bottom right) and the default values of the 
t-SNE python package for other settings. 


Even though math notations can change, such as 7 in the example above, one could assume the 
existence of a common ground for most notations. The low performance of our experiments 
compared to the results in [330] seem to confirm that math notations change regularly in 
real-world documents, i.e., are tied to a specific context. If a common ground exists, for math 
notations, it must be marginally small, at least in the 100 test cases from [18]. 


3.1.3 On Overcoming the Issues of Knowledge Extraction Approaches 


We assume the low performance regarding our knowledge extraction experiments are caused by 
fundamental issues that should be discussed before more efforts are made to train ML algorithms 
for extracting knowledge of math expressions. In the following, we discuss some reasons that 
we believe can help ML algorithms to understand mathematics better. 


It is reported that 70% of mathematical symbols are explicitly declared in the context [394]. 
Only four reasons justify an explicit declaration in the context: (a) a new mathematical symbol 
is defined, (b) a known notation is changed, (c) used symbols are present in other contexts and 
require specifications to be correctly interpreted, or (d) authors’ declarations are redundant 
(e.g., for improving readability). We assume (d) is a rare scenario compared to the other ones 
(a-c), except in educational literature. Current math-embedding techniques can learn semantic 
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connections only in that 70%, where the definiens is available. Besides (d), the algorithm 
would learn either rare notations (in case of (a)) or ambiguous notations (in cases (b-c)). The 
flexibility that mathematical documents allow to (re)define used mathematical notations further 
corroborates the complexity of learning mathematics. 


Learning algorithms would benefit from literature focused on (a) and (d), instead of (b) and (c). 
Similar to students who start to learn mathematics, ML algorithms have to consider the structure 
of the content they learn. It is hard to learn mathematics only considering arXiv documents 
without prior or complementary knowledge. Usually, these documents represent state-of-the- 
art findings containing new and unusual notations and lack of extensive explanations (e.g., 
due to page limitations). In contrast, educational books carefully and extensively explain new 
concepts. We assume better results can be obtained if ML algorithms are trained in multiple 
stages, first on educational literature, then on datasets of advanced math articles. A basic 
model trained in educational literature should capture standard relations between mathematical 
concepts and descriptors. This model should also be able to capture patterns independently of 
how new or unusual the notations are present in the literature. In 2014, Matsuzaki et al. [247] 
presented some promising results to answer mathematical questions from Japanese university 
entrance exams automatically. While the approach involves many manual adjustments and 
analysis, the promising results illustrate the different levels of knowledge that is still required 
for understanding arXiv documents vs. university entrance level exams. A well-structured 
digital mathematical library that distinguishes the different levels of sophistication in articles 
(e.g. introductions vs. state-of-the-art publications) would also benefit mathematical machine 
learning tasks. 


The lack of references and applications that provide a solid semantic structure of natural lan- 
guage for mathematical identifiers make the disambiguation process of the latter even more 
challenging. In natural texts, one can try to infer the most suitable word sense for a word based 
on the lemma” itself, the adjacent words, dictionaries, and thesauri to name a few. However, in 
the mathematical arena, the scarcity of resources and the flexibility of redefining their identifiers 
make this issue much harder. The context text preceding or following the mathematical equa- 
tion is essential for its understanding. This context can be considered in a long or short distance 
away from the equation, which aggravates the problem. Thus, a comprehensive annotated 
dataset that addresses these needs of structural knowledge would enable further progress in 
MathIR with the help of ML algorithms. 


Another primary source of complexity is the inherent ambiguity present in any language, 
especially in mathematics. A typical workaround in linguistics for such ambiguous notations is 
to consider the use of lexical databases (e.g., WordNet [116, 261]) to identify the most suitable 
word senses for a given word. These databases allow embeddings algorithms to train a vector 
for each semantic meaning for every token. For example, Java could have multiple vectors in 
a single model according to its different meanings of the word, e.g., the island in the south of 
Indonesia, the programming language or the coffee beans. However, mathematics lacks such 
systems, which makes its adoption not feasible at the moment. Youssef [402] proposes the 
use of tags, similarly to the PoS tags in linguistics, but for tagging mathematical TFX tokens, 
bringing more information to the tokens considered. As a result, a lexicon containing several 
meanings for a large set of mathematical symbols is developed. OntoMathPro [104, 105] aims for 
generating a comprehensive ontology of mathematical knowledge and, therefore, also contain 


canonical form, dictionary form, or citation form of a set of words 


Chapter 3 
Semantification of Mathematical LaTeX 


69 


70 


Section 3.2. Semantification with Mathematical Objects of Interest 


information about the different meanings of mathematical tokens. Such dictionaries might 
enable the disambiguation approaches in linguistics to be used in mathematical embedding in 
the near future. 


Another issue in recent publications is the lack of standards and the scarcity of benchmarks 
to properly evaluate MathIR algorithms. Krstovski and Blei [215], and Yasunaga and Laf- 
ferty [400] provide an interesting perspective on the problem of mathematic embeddings. Their 
experiments are focused on math-analogies. Our findings on Section 3.2 corroborate with the 
math-analogies results, as our experiments have comparable results in a controlled environ- 
ment. However, because of a missing well-established benchmark, we, as well the mentioned 
publications, are only able to provide incipient results. Existing datasets are often created 
for and, therefore, limited to specific tasks. For example, the NTCIR math tasks [21, 22, 405] 
or the upcoming ARQMath'!® task, provide datasets that are specifically designed to tackle 
problems of mathematical search engines. The secondary task of ARQMath actually search for 
math-analogies. In general, a proper, common standard for interpreting semantic structures of 
mathematics (see for example the mentioned problems with a, in Section 2) would be beneficial 
for several tasks in MathIR, such as semantic knowledge extraction. 


3.1.4 The Future of Math Embeddings 


As we explored through this section, our preliminary results stress the urgent need for creating 
extensive math-specific benchmarks for testing math embedding techniques on math-specific 
tasks. To appreciate more the magnitude and dimensions of creating such benchmarks, it is 
instructive to look at some of those developed for NLP whose tasks can beneficially inform 
and guide corresponding tasks in MLP. The NLP benchmarks include one for natural language 
inference [47], one for machine comprehension [306], one for semantic role modeling [281], 
and one for language modeling [68], to name a few. With such benchmarks, which are often de 
facto standards for the corresponding NLP tasks, the NLP research community has been able 
to (1) measure the performance of new techniques up to statistical significance, and (2) track 
progress in various NLP techniques, including deep learning for NLP, by quickly comparing 
the performance of new techniques to others and to the state-of-the-art. 


While our exploratory studies regarding our term similarities, analogies, and query expansions 
need extensive future experimentation for statistically significant validation on large datasets 
and benchmarks, they show some of the promise and limitations of word embedding in math 
(MLP) applications. Especially its applicability for our desired knowledge extraction process is 
highly questionable. One of the main issues we encountered for embedding mathematics is the 
inability to model the nested semantic structure of mathematical expressions. In the following, 
we will further explore properties of mathematical subexpressions by analyzing their frequency 
distributions in large datasets. 


3.2 Semantification with Mathematical Objects of Interest 


As discussed before, math expressions often contain meaningful and important subexpressions. 
MathIR [141] applications could benefit from an approach that lies between the extremes of 


https: //www.cs.rit.edu/~dpr1/ARQMath/ [accessed 2020-02-01] 
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examining only individual symbols or considering an entire equation as one entity. Consider 
for example, the explicit definition for Jacobi polynomials [98, (18.5.7)] 


8 The Explicit Definition of Jacobi Polynomials 


(8) Tatntl)  (n\Tlatß+n+mH+1) (z-1\" 
m Omen 2 m Tat+m-+1) (F) 2 


m=0 
The interesting components in this equation are pi? ) (x) on the left-hand side, and the ap- 
pearance of the gamma function I (s) on the right-hand side, implying a direct relationship 
between Jacobi polynomials and the gamma function. Considering the entire expression as a 
single object misses this important relationship. On the other hand, focusing on single symbols 
can result in the misleading interpretation ofT as a variable and [(a+n-+1) as a multiplication 
between T and (a + n + 1). A system capable of identifying the important components, such 
as POP) (x) or (a+ n + 1), is therefore desirable. Hereafter, we define these components as 
Mathematical Objects of Interest (MOI) [9]. 


The importance of math objects is a somewhat imprecise description and thus difficult to mea- 
sure. Currently, not much effort has been made in identifying meaningful subexpressions. 
Kristianto et al. [214] introduced dependency graphs between formulae. With this approach, 
they were able to build dependency graphs of mathematical expressions, but only if the expres- 
sions appeared as single expressions in the context. For example, if T (a + n + 1) appears as 
a stand-alone expression in the context, the algorithm will declare a dependency with Equa- 
tion (3.2). However, it is more likely that different forms, such as I (s), appear in the context. 
Since this expression does not match any subexpression in Equation (3.2), the approach cannot 
establish a connection with T (s). Kohlhase et al. studied in [191, 193, 196] another approach 
to identify essential components in formulae. They performed eye-tracking studies to identify 
important areas in rendered mathematical formulae. While this is an interesting approach that 
allows one to learn more about the insights of human behaviors of reading and understanding 
math, it is inaccessible for extensive studies. 


This section presents the first extensive frequency distribution study of mathematical equations 
in two large scientific corpora, the e-Print archive arXiv.org (hereafter referred to as arXiv") 
and the international reviewing service for pure and applied mathematics zh»MATH®. We will 
show that math expressions, similar to words in natural language corpora, also obey Zipf’s 
law [297], and therefore follows a Zipfian distribution. Related research projects observed a 
relation to Zipf’s law for single math symbols [71, 329]. In the context of quantitative linguistics, 
Zipf’s law states that given a text corpus, the frequency of any word is inversely proportional 
to its rank in the frequency table. Motivated by the similarity to linguistic properties, we will 
present a novel approach for ranking formulae by their relevance via a customized version of 
the ranking function BM25 [310]. We will present results that can be easily embedded in other 
systems in order to distinguish between common and uncommon notations within formulae. 
Our results lay a foundation for future research projects in MathIR. 


"https://arxiv.org/ [accessed 2019-09-01] 
https: //zbmath. org [accessed 2019-09-01] 
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Fundamental knowledge on frequency distributions of math formulae is beneficial for numerous 
applications in MathIR, ranging from educational purposes [341] to math recommendation 
systems [50], search engines [92, 274], and even automatic plagiarism detection systems [253, 
254, 334]. For example, students can search for the conventions to write certain quantities in 
formulae; document preparation systems can integrate an auto-completion or auto-correction 
service for math inputs; search or recommendation engines can adjust their ranking scores 
with respect to standard notations; and plagiarism detection systems can estimate whether two 
identical formulae indicate potential plagiarism or are just using the conventional notations in 
a particular subject area. To exemplify the applicability of our findings, we present a textual 
search approach to retrieve mathematical formulae. Further, we will extend zbMATH’s faceted 
search by providing facets of mathematical formulae according to a given textual search query. 
Lastly, we present a simple auto-completion system for math inputs as a contribution towards 
advancing mathematical recommendation systems. Further, we show that the results provide 
useful insights for plagiarism detection algorithms. We provide access to the source code, 
the results, and extended versions of all of the figures appearing in this paper at https : 
//github.com/ag-gipp/FormulaCloudData. 


3.2.1 Related Work 


Today, mathematical search engines index formulae in a database. Much effort has been un- 
dertaken to make this process as efficient as possible in terms of precision and runtime per- 
formance [92, 181, 231, 236, 407]. The generated databases naturally contain the information 
required to examine the distributions of the indexed mathematical formulae. Yet, no in-depth 
studies of these distributions have been undertaken. Instead, math search engines focus on 
other aspects, such as devising novel similarity measures and improving runtime efficiency. 
This is because the goal of math search engines is to retrieve relevant (i.e., similar) formulae 
which correspond to a given search query that partially [211, 231, 274] or exclusively [92, 181, 
182] contains formulae. However, for a fundamental study of distributions of mathematical 
expressions, no similarity measures nor efficient lookup or indexing is required. Thus, we use 
the general-purpose query language XQuery and employ the BaseX!? implementation. BaseX 
is a free open-source XML database engine, which is fully compatible with the latest XQuery 
standard [140, 396]. Since our implementations rely on XQuery, we are able to switch to any 
other database which allows for processing via XQuery. 


3.2.2 Data Preparation 


BIEX is the de facto standard for the preparation of academic manuscripts in the fields of 
mathematics and physics [129]. Since BIFX allows for advanced customizations and even 
computations, it is challenging to process. For this reason, BIFX expressions are unsuitable for 
an extensive distribution analysis of mathematical notations. For mathematical expressions on 
the web, the XML formatted MathML” is the current standard, as specified by the World Wide 
Web Consortium (W3C). The tree structure and the fixed standard, i.e., MathML tags, cannot be 
changed, thus making this data format reliable. Several available tools are able to convert from 
BIEX to MathML [18] and various databases are able to index XML data. Thus, for this study, 


“http: //basex.org/ [accessed 2019-09-01]; We used BaseX 9.2 for our experiments. 
https: //www.w3.org/TR/MathML3/ [accessed 2019-09-01] 
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we have chosen to focus on MathML. In the following, we investigate the databases arXMLiv 
(08/2018) [132] and zbMATH?! [333]. 


The arXMLiv dataset (21.2 million documents) contains HTML5 versions of the documents 
from the e-Print archive arXiv.org. The HTML5 documents were generated from the TEX 
sources via KIExmr [257]. KIExML converted all mathematical expressions into MathML with 
parallel markup, i.e., presentation and content MathML. In this study we only consider the 
subsets no-problem and warning, which generated no errors during the conversion process. 
Nonetheless, the MathML data generated still contains some errors or falsely annotated math. 
For example, we discovered several instances of affiliation and footnotes, SVG” and other 
unknown tags, encoded in MathML. Regarding the footnotes, we presumed that authors falsely 
used mathematical environments for generating footnote or affiliation marks. We used the TeX 
string, provided as an attribute in the MathML data, to filter out expressions that match the 
string ‘{}~{*}’, where ‘*’ indicates any possible expression. In addition, we filtered out SVG 
and other unknown tags. We assume that these expressions were generated by mistake due to 
limitations of KIExmu. The final arXiv dataset consisted of 841,008 documents which contained 
at least one mathematical formula. The dataset contained a total of 294,151,288 mathematical 
expressions. 


In addition to arXiv, we investigated zbMATH, an international reviewing service for pure and 
applied mathematics which contains abstracts and reviews of articles, hereafter uniformly called 
abstracts, mainly from the domains of pure and applied mathematics. The abstracts in zbMATH 
are formatted in TEX [333]. To be able to compare arXiv and zbMATH, we manually generated 
MathML via KIExmL for each mathematical formula in zbMATH and performed the same filters 
as used for the arXiv documents. The zbMATH dataset contained 2,813,451 abstracts, of which 
1,349,297 contained at least one formula. In total, the dataset contained 11,747,860 formulae. 
Even though the total number of formulae is smaller compared to arXiv, we hypothesize that 
math formulae in abstracts are particularly meaningful. 


3.2.2.1 Data Wrangling 


Since we focused on the frequency distributions of visual expressions, we only considered 
pMML. Rather than normalizing the pMML data, e.g., via MathMLCan [117], which would also 
change the tree structure and visual core elements in pMML, we only eliminated the attributes. 
These attributes are used for minor visual changes, e.g., stretched parentheses or inline limits 
of sums and integrals. Thus, for this first study, we preserved the core structure of the pMML 
data, which might provide insightful statistics for the MathML community to further cultivate 
the standard. After extracting all MathML expressions, filtering out falsely annotated math and 
SVG tags, and eliminating unnecessary attributes and annotations, the datasets required 83GB 
of disk space for arXiv and 6GB for zbMATH, respectively. 


In the following, we indexed the data via BaseX. The indexed datasets required a disk space of 
143.9GB in total (140GB for arXiv and 3.9GB for zbMATH). Due to the limitations” of databases 
in BaseX, it was necessary to split our datasets into smaller subsets. We split the datasets 


"nttps://zbmath.org/ [accessed 2019-09-01] 

Scalable Vector Graphics 

®A detailed overview of the limitations of BaseX databases can be found at http: //docs.basex.org/wiki/ 
Statistics [accessed 2019-09-01]. 
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according to the 20 major article categories of arXiv‘ and classifications ofzbMATH. To increase 
performance, we use BaseX in a server-client environment. We experienced performance issues 
in BaseX when multiple clients repeatedly requested data from the same server in short intervals. 
We determined that the best workaround for this issue was to launch BaseX servers for each 
database, i.e., each category/classification. 


Mathematical expressions often consist of multiple meaningful subexpressions, which we de- 
fined as MOIs. However, without further investigation of the context, it is impossible to deter- 
mine meaningful subexpressions. As a consequence, every equation is a potential MOI on its 
own and potentially consists of multiple other MOls. For an extensive frequency distributional 
analysis, we aim to discover all possible mathematical objects. Hence, we split every formula 
into its components. Since MathML is an XML data format (essentially a tree-structured format), 
we define subexpressions of equations as subtrees of its MathML format. 


(a,b) 


Listing 3.1 illustrates a Jacobi polynomial Pà ” (x) in pMML. 
& PPa) The <mo> element on line 14 contains the invisible times UTF-8 
character. By definition, the <math> element is the root element 
<math><mrow> of MathML expressions. Since we cut off all other elements be- 
<msubsup> sides pMML nodes, each <math> element has one and only one 
a child element”. Thus, we define the child element of the <math> 
Zanes element as the root of the expression. Starting from this root 
<mo>(</mo> element, we explore all subexpressions. For this study, we pre- 
<mi>a</mi> sume that every meaningful mathematical object (i.e., MOI) must 
<mo>,</mo> contain at least one identifier. 
<mi>$</mi> 
<mo>)</mo> Hence, we only study subtrees which contain at least one <mi> 
<mo></mo> node. Identifiers, in the sense of MathML, are ‘symbolic names 
</mrow> or arbitrary text’ *°, e.g., single Latin or Greek letters. Identi- 
</msubsup> fiers do not contain special characters (other than Greek letters) 
<mo></mo> or numbers. As a consequence, arithmetic expressions, such 
EEN as (1 + 2)?, or sequences of special characters and numbers, 
<mo>(</mo> a ; S 
O such as {1, 2, ...} N {—1}, will not appear in our distributional 
<mo>)</mo> analysis. However, if a sequence or arithmetic expression con- 
</mrow> sists of an identifier somewhere in the pMML tree (such as in 
</mrow></math> {1,2,...} N A), the entire expression will be recognized. The 


Pho) 


Jacobi polynomial P} (x), therefore consists of the following 


Listing 3.1: MathML repre- 


sentation ot plows) (2). subexpressions: pe, (a, B), (x), and the single identifiers P, 


n, a, 3, and x. The entire expression is also a mathematical ob- 

ject. Hence, we take entire expressions with an identifier into 
account for our analysis. In the following, the set of subexpressions will be understood to 
include the expression itself. 


For our experiments, we also generated a string representation of the MathML data. The string 
is generated recursively by applying one of two rules for each node: (i) if the current node is a 
leaf, the node-tag and the content will be merged by a colon, e.g., <mi>x</mi> will be converted 


The arXiv categories astro-ph (astro physics), cond-mat (condensed matter), and math (mathematics) were still 
too large for a single database. Thus, we split those categories into two equally sized parts. 

Sequences are always nested in an <mrow> element. 

https: //www.w3.org/TR/MathML3/chapter3.html [accessed 2019-09-01] 
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to mi:x; (ii) otherwise the node-tag wraps parentheses around its content and separates the 
children by a comma, e.g., 


<mrow><mo>(</mo><mi>x</mi><mo>) </mo></mrow> (3.3) 


will be converted to 
mrow(mo:(,mi:x,mo:)). (3.4) 


Furthermore, the special UTF-8 characters for invisible times (U+2062) and function application 
(U+2061) are replaced by ivt and fa, respectively. For example, the gamma function with 
argument x + 1, I’ (x + 1) would be represented by 


mrow(mi:I,mo:ivt,mrow(mo: (,mrow(mi:x,mo:+,mn:1),mo:))). (3.5) 


Between T and (a+ 1), there would most likely be the special character for invisible times rather 
than for function application, because KIExML is not able to parse I as a function. Note that this 
string conversion is a bijective mapping. The string representation reduces the verbose XML 
format to a more concise presentation. Thus, an equivalence check between two expressions is 
more efficient. 


3.2.2.2 Complexity of Math 


Mathematical expressions can become complex and lengthy. The tree structure of MathML 
allows us to introduce a measure that reflects the complexity of mathematical expressions. 
More complex expressions usually consist of more extensively nested subtrees in the MathML 
data. Thus, we define the complexity of a mathematical expression by the maximum depth of 
the MathML tree. In XML the content of a node and its attributes are commonly interpreted as 
children of the node. Thus, we define the depth of a single node as 1 rather than 0, i.e., single 
identifiers, such as <mi>P</mi>, have a complexity of 1. The Jacobi polynomial from Listing 3.1 
has a complexity of 4. 


We perform the extraction of subexpressions from MathML in BaseX. The algorithm for the 
extraction process is written in XQuery. The algorithm traverses recursively downwards from 
the root to the leaves. In each iteration, it checks whether there is an identifier, i.e., <mi> 
element, among the descendants of the current node. If there is no such element, the subtree 
will be ignored. It seems counterintuitive to start from the root and check if an identifier is 
among the descendants rather than starting at each identifier and traversing upwards to the root. 
If an XQuery requests a node in BaseX, BaseX loads the entire subtree of the requested node 
into the cache (up to a specified size). If the algorithm traverses upwards through the MathML 
tree, the XQuery will trigger database requests in every iteration. Hence, the downwards 
implementation performs better, since there is only one database request for every expression 
rather than for every subexpression. 


Since we only minimize the pMML data rather than normalizing it, two identically rendered 
expressions may have different complexities. For instance, 


<mrow><mi>x</mi></mrow> (3.6) 


consists of two distinct subexpressions, but both of them are displayed the same. Another 
problem often appears for arrays or similar visually complicated structures. The extracted 
expressions are not necessarily logical subexpressions. We will consider applying more advanced 
embedding techniques such as special tokenizers [231], symbol layout trees [92, 407], and a 
MathML normalization via MathMLCan [117] in future research to overcome these issues. 
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3.2.3 Frequency Distributions of Mathematical Formulae 


By splitting each formula into subexpressions, we generated longer documents and a bias 
towards low complexities. Note that, hereafter, we only refer to the mathematical content of 
documents. Thus, the length of a document refers to the number of math formulae - here the 
number of subexpressions - in the document. After splitting expressions into subexpressions, 
arXiv consists of 2.5B and zbMATH of 61M expressions, which raised the average document 
length to 2,982.87 for arXiv and 45.47 for zbMATH, respectively. 


For calculating frequency distributions, we merged two subexpressions if their string repre- 
sentations were identical. Remember, the string representation is unique for each MathML 
tree. After merging, arXiv consisted of 350,206,974 unique mathematical subexpressions with a 
maximum complexity of 218 and an average complexity of 5.01. For high complexities over 70, 
the formulae show some erroneous structures that might be generated from EIExmt by mistake. 
For example, the expression with the highest complexity is a long sequence of a polynomial 
starting with ‘P,(t,, ts, t7,t,,) =’ followed by 690 summands. The complexity is caused by 
a high number of unnecessarily deeply nested <mrow> nodes. The highest complexity with a 
minimum document frequency of two is 39, which is a continued fraction. Since continued 
fractions are nested fractions, they naturally have a large complexity. One of the most complex 
expressions (complexity 20) with a minimum document frequency of three was the formula 


a\a 
dm-ı 2 22 
n n n qm qm x 
LIE || lerne) <cK aliii 87) 
jı=1 | j2=1 jm=1 


In contrast, ZDMATH only consisted of 8,450,496 unique expressions with a maximum complex- 
ity of 26 and an average complexity of 3.89. One of the most complex expressions in z» MATH 
with a minimum document frequency of three was 


T 1/ 
M,(r, f) = (= f "iy (ret) a) 7 (3.8) 


As we expected, reviews and abstracts in zbMATH were generally shorter and consisted of 
less complex mathematical formulae. The dataset also appeared to contain fewer erroneous 


expressions, since expressions of complexity 25 are still readable and meaningful. 


Figure 3.2 shows the ratio of unique subexpressions for each complexity in both datasets. The 
figure illustrates that both datasets share a peak at complexity four. Compared to zbMATH, the 
arXiv expressions are slightly more evenly distributed over the different levels of complexities. 
Interestingly, complexities one and two are not dominant in either of the two datasets. Single 
identifiers only make up 0.03% in arXiv and 0.12% in zbMATH, which is comparable to expres- 
sions of complexity 19 and 14, respectively. This finding illustrates the problem of capturing 
semantic meanings for single identifiers rather than for more complex expressions [330]. It 
also substantiates that entire expressions, if too complex, are not suitable either for capturing 
the semantic meanings [214]. Instead, a middle ground is desirable, since the most unique 
expressions in both datasets have a complexity between 3 and 5. Table 3.4 summarizes the 
statistics of the examined datasets. 
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Unique Subexpressions per Complexity 
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Figure 3.2: Unique subexpressions for each complexity in arXiv and zbMATH. 


Table 3.4: Dataset overview. Average Document Length is defined as the average number of 
subexpressions per document. 


Category arXiv zbMATH 
Documents 841,008 1,349,297 
Formulae 294,151,288 11,747,860 
Subexpressions 2,508,620,512 61,355,307 
Unique Subexpressions 350,206,974 8,450,496 
Average Document Length 2,982.87 45.47 
Average Complexity 5.01 3.89 
Maximum Complexity 218 26 


3.2.3.1 Zipf’s Law 


In linguistics, it is well known that word distributions follow Zipf’s Law [297], i.e., the r-th 
most frequent word has a frequency that scales to 


f(r) x zZ (3.9) 


with a = 1. A better approximation can be applied by a shifted distribution 


= 
(r + B)e’ 


where a = 1 and 8 ~ 2.7. Ina study on Zipf’s law, Piantadosi [297] illustrated that not only 
words in natural language corpora follow this law surprisingly accurately, but also many other 
human-created sets. For instance, in programming languages, in biological systems, and even 
in music. Since mathematical communication has derived as the result of centuries of research, 


Flr) & (3.10) 


it would not be surprising if mathematical notations would also follow Zipf’s law. The primary 
conclusion of the law illustrates that there are some very common tokens against a large number 
of symbols which are not used frequently. Based on this assumption, we can postulate that a 
score based on frequencies might be able to measure the peculiarity of a token. The infamous 
TF-IDF ranking functions and their derivatives [23, 310] have performed well in linguistics for 
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many years and are still widely used in retrieval systems [30]. However, since we split every 
expression into its subexpressions, we generated an anomalous bias towards shorter, i.e., less 
complex, formulae. Hence, distributions of subexpressions may not obey Zipf’s law. 


Frequency Distributions in zoMATH Complexity Distributions in zDMATH 
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30 
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Figure 3.3: Each figure illustrates the relationship between the frequency ranks (x-axis) and the 
normalized frequency (y-axis) in zZbMATH (top) and arXiv (bottom). For arXiv, only the first 8 
million entries are plotted to be comparable with zbMATH (* 8.5 million entries). Subfigure (a) 
shades the hexagonal bins from green to yellow using a logarithmic scale according to the 
number of math expressions that fall into a bin. The dashed orange line represents Zipf’s 
distribution (3.10). The values for a and 8 are provided in the plots. Subfigure (b) shades the 
bins from blue to red according to the maximum complexity in each bin. 


Figure 3.3 visualizes a comparison between Zipf’s law and the frequency distributions of math- 
ematical subexpressions in arXiv and zbMATH. The dashed orange line visualizes the power 
law (3.10). The plots demonstrate that the distributions in both datasets obey this power law. 
Interestingly, there is not much difference in the distributions between both datasets. Both dis- 
tributions seem to follow the same power law, with a = 1.3 and 8 = 15.82. Moreover, we can 
observe that the developed complexity measure seems to be appropriate, since the complexity 
distributions for formulae are similar to the distributions for the length of words [297]. In other 
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words, more complex formulae, as well as long words in natural languages, are generally more 
specialized and thus appear less frequent throughout the corpus. Note that colors of the bins for 
complexities fluctuate for rare expressions because the color represents the maximum rather 
than the average complexity in each bin. 


3.2.3.2 Analyzing and Comparing Frequencies 


Figure 3.4 shows in detail the most frequently used mathematical expressions in arXiv for 
the complexities 1 to 7. The orange dashed line visible in all graphs represents the normal 
Zipf’s law distribution from Equation (3.9). We explore the total frequency values without any 
normalization. Thus, Equation (3.9) was multiplied by the highest frequency for each complexity 
level to fit the distribution. The plots in Figure 3.4 demonstrate that even though the parameter 
a varies between 0.35 and 0.62, the distributions in each complexity class also obey Zipf’s law. 


The plots for each complexity class contain some interesting fluctuations. We can spot a set 
of five single identifiers that are most frequently used throughout arXiv: n, i, x, t, and k. Even 
though the distributions follow Zipf’s law accurately, we can explore that these five identifiers 
are proportionally more frequently used than other identifiers and clearly separate themselves 
above the rest (notice the large gap from k to a). All of the five identifiers are known to be used 
in a large variety of scenarios. Surprisingly, one might expect that common pairs of identifiers 
would share comparable frequencies in the plots. However, typical pairs, such as x and y, or a 
and £, possess a large discrepancy. 


The plot of complexity two also reveals that two expressions are proportionally more often used 
than others: (x) and (t). These two expressions appear more than three times as often in the 
corpus than any other expression of the same complexity. On the other hand, the quantitative 
difference between (x) and (t) is negligible. We may assume that arXiv’s primary domain, 
physics, causes the quantitative disparity between (x), (t), and the other tokens. The primary 
domain of the dataset becomes more clearly visible for higher complexities, such as SU (2) 
(C3?) or kms7! (C4). 


Another surprising property of arXiv is that symmetry groups, such as SU(2), appear to 
play an essential role in the majority of articles on arXiv, see SU(2) (C3), SU(2), (C4), and 
SU(2) x SU(2) (C5), among others. The plots of higher complexities”®, made this even more 
noticeable. Given a complexity of six, for example, the most frequently used expression was 
SU(2), x SU(2)p, and for a complexity of seven it was SU(3) x SU(2) x U(1). Given a 
complexity of eight, ten out of the top-12 expressions were from symmetry group calculations. 


It is also worthwhile to compare expressions among different levels of complexities. For instance, 
(x) and (t) appeared almost six million times in the corpus, but f(x) (at position three in 
C3) was the only expression which contained one of these most common expressions. Note 
that subexpressions of variations, such as (29), (tg), or (t — t’), do not match the expression 
of complexity two. This may imply that (x), and especially (t), appear in many different 
scenarios. Further, we can examine that even though (x) is a part of f(x) in only approximately 
3% of all cases, it is still the most likely combination. These results are especially useful for 
recommendation systems that make use of math as input. Moreover, plagiarism detection 


"We refer to a given complexity n with Cn, i.e., C3 refers to complexity 3. 
More plots showing higher complexities are available at https : / / github . com /ag - gipp / Form 
ulaCloudData [accessed 2021-10-01] 
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Figure 3.4: Overview of the most frequent mathematical expressions in arXiv for complexities 
1-7. The color gradient from yellow to blue represents the frequency in the dataset. Zipf’s 
law (3.9) is represented by a dashed orange line. 
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systems may also benefit from such a knowledge base. For instance, it might be evident that 
f(x) is a very common expression, but for automatic systems that work on a large scale, it is 
not clear whether duplicate occurrences of f(x) or E(x) should be scored differently, e.g., in 
the case of plagiarism detection. 


Figure 3.4 shows only the most frequently occurring expressions in arXiv. Since we already 
explored a bias towards physics formulae in arXiv, it is worth comparing the expressions present 
within both datasets. Figure 3.5 compares the 25-top expressions for the complexities one to 
six. In zbMATH, we discovered that computer science and graph theory appeared as popular 
topics, see for example G = (V, E) (in C3 at position 20) and the Bachmann-Landau notations 
in O(log n), O(n?), and O(n?) (C4 positions 2, 3, and 19). 


From Figure 3.5, we can also deduce useful information for MathIR tasks which focus on 
semantic information. Current semantic extraction tools [330] or KIEX parsers [18] still have 
difficulties distinguishing multiplications from function calls. For example as mentioned before, 
KIExmı [257] adds an invisible times character between f(x) rather than a function application. 
Investigating the most frequently used terms in zbMATH in Table 3.5 reveals that u is most likely 
considered to be a function in the dataset: u(t) (rank 8), u(x) (rank 13), w,,,, (rank 16), u(0) (rank 
17), |Vu| (rank 22). Manual investigations of extended lists reveal even more hits: ug(a) (rank 
30), —Au (rank 32), and u(x,t) (rank 33). Since all eight terms are among the most frequent 
35 entries in zbMATH, it implies that u can most likely be considered to imply a function in 
zbMATH. Of course, this does not imply that u must always be a function in zbMATH (see f(u) 
on rank 14 in C3), but this allows us to exploit probabilities for improving MathIR performance. 
For instance, if not stated otherwise, u could be interpreted as a function by default, which 
could help increase the precision of the aforementioned tools. 


Figure 3.5 also demonstrates that our two datasets diverge for increasing complexities. Hence, we 
can assume that frequencies of less complex formulae are more topic-independent. Conversely, 
the more complex a math formula is, the more context-specific it is. In the following, we will 
further investigate this assumption by applying TF-IDF rankings on the distributions. 


3.2.4 Relevance Ranking for Formulae 


Zipf’s law encourages the idea of scoring the relevance of words according to their number 
of occurrences in the corpus and in the documents. The family of BM25 ranking functions 
based on TF-IDF scores are still widely used in several retrieval systems [30, 310]. Since we 
demonstrated that mathematical formulae (and their subexpressions) obey Zipf’s law in large 
scientific corpora, it appears intuitive to also use TF-IDF rankings, such as a variant of BM25, 
to calculate their relevance. 


& Okapi BM25 


In its original form [310], Okapi BM25 was calculated as follows 


(k + 1) IDF(t) TF(t, d) 


bld i 
TF(t,d) +k(1—b+ ae) 


bm25(t, d) := 


(3.11) 


Chapter 3 
Semantification of Mathematical LaTeX 


81 


82 


Section 3.2. Semantification with Mathematical Objects of Interest 


Complexity 1 


arXMLiv zbMATH 
ne en 


i 


Complexity 2 
arXMLiv zbMATH 
(x) —— (r) 
(t) 
(G) 
(n) 
(2) 
m" 
(X) 


Complexity 3 
arXMLiv zbMATH 
(2m) ¢ f(x) 
(n-1) @., $ f(z) 
F(x) 7 x(t) 
(n+1) R") 
grt (n-1) 
(a0) (n+1) 
vn) v (G) 
fi u(t) 
Se 
= 


F(x) 

Ure 

u(0) 

p(z) 
(logn) 
G=(V,E) 
ou) 

[Vul 

y(t) 

(an) 


Tai 
Complexity 4 
arXMLiv zbMATH 

(27)? + + (nlogn) 
$ 0 (logn) 
b O(n?) 
(tæ (t)) 
* f(0)=0 
If) 
(x (¢)) 


l 
eee ee es 
. 


1+o(1) 
SU), $ 


kms! è 


& 

w 
3 Ss 
Coe ee er 


L (R") 


u(x, 0) = w(x) 


1- 


2- 


3- 


Complexity 5 
arXMLiv zbMATH 
(1+o(1)) $ O (nlogn) 


dk 
on 


SU (2) x SU 
SU (2) x U(1) 


O(nlogn) $ 
v(l)xU() $ 
ea. + 


er) 


Dirt 


) 
(=e) 
(1+0(1)) 
$ |VuP Vu 
$ o (10n) 
$ full) 
$ fitt) 
$ 30, 
$ (re) 
$ (t-7() 
$ |f) - f0) 
$ A= (m) 

é 2 (to) =a 
$ O (n2logn) 
$ logn/loglogn 
$ n(n+1)/2 
é n(n- 1)/2 
+ (niog’n) 
$ F (t,x (t)) 
© O (loglogn) 


Complexity 6 


arXMLiv 
SU(2), x SUR $ 
(49? + sin?@dy’) + 
SU(2), x U(l)y ¢ 


4-SU (3) x SU (2) xU $ 


5- 


6- 


SU()e x SU), $ 


|Vul’dx 
[B(H)"], ¢ 
(46? + sin?0d8?) + 


dk 


dNajdn $ 
(CELD 


GE 


zbMATH 
$ (Vu vu) 
$ O (nlog’n) 
© (logn/loglogn) 


* (1-2P)" 


# F (re) 
$ F(z) = 24 Doone" 
è 26-700) 


+ o(a) 


$ (lognloglogn) 
$ wt f(u) =0 
$ (\DuP”Du) 
$ O (nloglogn) 
$ [Vu -Vu 
$ C(1/2+ it) | 
$ falVuļ de 
|Vul’dz 
$ K@+i)]| 
s (G-ePIrOI 
* (eP) 


© —(py'! 


Figure 3.5: The top-20 and 25 most frequent expressions in arXiv (left) and zbMATH (right) for 
complexities 1-6. A line between both sets indicates a matching set. Bold lines indicate that the 
matches share a similar rank (distance of 0 or 1). 
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Here, TF (t, d) is the term frequency of t in the document d, |d| the length of the document d 
(in our case, the number of subexpressions), AVGp,, the average length of the documents in 
the corpus (see Table 3.4), and IDF (t) is the inverse document frequency of t, defined as 


N-nl)+4 
n(t)+3 
where N is the number of documents in the corpus and n(t) the number of documents which 
contain the term t. By adding 1, we avoid log 0 and division by 0. The parameters k and b are 
free, with b controlling the influence of the normalized document length and k controlling the 
influence of the term frequency on the final score. For our experiments, we chose the standard 

value k = 1.2 and a high impact factor of the normalized document length via b = 0.95. 


IDF(t) := log 3 (3.12) 


As a result of our subexpression extraction algorithm, we generated a bias towards low complex- 
ities. Moreover, longer documents generally consist of more complex expressions. As demon- 
strated in Section 3.2.2.1, a document that only consists of the single expression PA (2), 
i.e., the document had a length of one, would generate eight subexpressions, i.e., it results in 
a document length of eight. Thus, we modify the BM25 score in Equation (3.11) to emphasize 
higher complexities and longer documents. First, the average document length is divided by 
the average complexity AVG... in the corpus that is used (see Table 3.4), and we calculate the 
reciprocal of the document length normalization to emphasize longer documents. 


Moreover, in the scope of a single document, we want to emphasize expressions that do not 
appear frequently in this document, but are the most frequent among their level of complexity. 
Thus, less complex expressions are ranked more highly if the document overall is not very 
complex. To achieve this weighting, we normalize the term frequency of an expression t 
according to its complexity c(t) and introduce an inverse term frequency according to all 
expressions in the document. We define the inverse term frequency as 


|d| — TF(t,d) + 4 


ITF(t,d) := 1 3.13 
(t, d) := log TF, d) +4 (3.13) 
E Definition of the importance score of a formula in a document 
Finally, we define the score s(t, d) of a term t in a document d as 
k + 1) IDF (t) ITF (t, d) TF (t, d 
(u.a); H DIDF@) ITE (td) T(t d) an 


TE 
jinax TR(Y,d) +k (1—b+ HAVE) 


The TF-IDF ranking functions and the introduced s(t, d) are used to retrieve relevant documents 
for a given search query. However, we want to retrieve relevant subexpressions over a set of 
documents. 
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z) Definition of the Mathematical BM25 


Thus, we define the score of a formula (mBM25) over a set of documents as the maximum 
score over all documents 


mBM25(t, d) := max Xt, d), (3.15) 
where D is a set of documents. 


We used Apache Flink [157] to count the expressions and process the calculations. Thus, our 
implemented system scales well for large corpora. 


Table 3.6 shows the top-7 scored expressions, . . : 
where D is the entire zuMATH dataset. The re- Table 3.5: Settings for the retrieval experi- 


trieved expressions can be considered as meaning- ments. 

ful and ee a of nn ee arXiv zbMATH 
eee . ey es ic nn eee PETE TE 200 
concepts, such as Gal(Q/Q), which refers to the Min. Hit Freq. 7 7 


Galois group of Q over Q, or L?(R?), which refers 
to the L?-space (also known as Lebesgue space) 
over R?. However, a more topic-specific retrieval 
algorithm is desirable. To achieve this goal, we (i) 
retrieved a topic-specific subset of documents D, C D for a given textual search query q, and 
(ii) calculated the scores of all expressions in the retrieved documents. To generate D_, we 
indexed the text sources of the documents from arXiv and zbMATH via Elasticsearch (ES)? 
and performed the pre-processing steps: filtering stop words, stemming, and ASCH-folding”. 
Table 3.5 summarizes the settings we used to retrieve MOls from a topic-specific subset of 
documents D,. We also set a minimum hit frequency according to the number of retrieved 
documents an expression appears in. This requirement filters out uncommon notations. 


Min. DF 50 10 
Max. DF | 10k 10k 


Figure 3.6 shows the results for five search queries. We asked a domain expert from the NIST to 
annotate the results as related (shown as green dots in Figure 3.6) or non-related (red dots). We 
found that the results range from good performances (e.g., for the Riemann zeta function) to 
bad performances (e.g., beta function). For instance, the results for the Riemann zeta function 
are surprisingly accurate, since we could discover that parts of Riemann’s hypothesis?! were 
ranked highly throughout the results (e.g., ¢ (4 + it)). On the other hand, for the beta function, 
we retrieved only a few related hits, of which only one had a strong connection to the beta 
function B(x, y). We observed that the results were quite sensitive to the chosen settings (see 
Table 3.5). For instance, according to the beta function, the minimum hit frequency has a strong 
effect on the results, since many expressions are shared among multiple documents. For arXiv, 
the expressions B(a, 3) and B(x, y) only appear in one document of the retrieved 40. However, 
decreasing the minimum hit frequency would increase noise in the results. 


“https: //github.com/elastic/elasticsearch [accessed 2019-09-01]. We used version 7.0.0 

"This means that non-ASCII characters are replaced by their ASCII counterparts or will be ignored if no such 
counterpart exists. 

“Riemann proposed that the real part of every non-trivial zero of the Riemann zeta function is 1/2. If this 
hypothesis is correct, all the non-trivial zeros lie on the critical line consisting of the complex numbers 1/2 + it. 
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"Beta Function’ "Bessel Function’ "Trigonometric Function’ 
arXiv zbMATH arXiv zbMATH arXiv zbMATH 
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2- (Ane v 2- 1- é eG/G 
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4- (N) 4- (-)D™ è ec 
5- g (u) 5- iz è @ 2k 
6- ba 6- ie oP. 
7- 3e 7- —l/p è è (x) 
8- A (u) 8- P è ot 
9- 4E 9- cost @ è rr 
10- a(n) 10- sinz é eu 
1- be 1u- »F 6 evn 
12- be 12- vo eG 
13- By B- zie ex 
14- $ è 1,2, 14- A e „or 
15- MS é e (x+y) 15 - i b » sin 
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19 - YMe é (a,b) 19- sn è À 
20 - diny e è Psd 20 - sina © ec 
"Gamma Function’ "Riemann Zeta Function’ 
arXiv zbMATH arXiv zbMATH 
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Figure 3.6: Top-20 ranked expressions retrieved from a topic-specific subset of documents D,. 
The search query q is given above the plots. Retrieved formulae are annotated by a domain 
expert with green dots for relevant and red dots for non-relevant hits. A line is drawn if a hit 
appears in both result sets. The line is colored in green when the hit was marked as relevant. 
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Table 3.6: Top s(t, D) scores, where D is the set of all zoMATH documents with a minimum 
document frequency of 200, maximum document frequency of 500k, and a minimum complexity 
of 3. 


C3 C4 C5 


114.84 (n!) | 1294 i,j=1,...,n | 119.21 Gal(Q/Q) 


108.85 =1 | 108.52 Tij 112.55 f@)P 
100.19 2”! | 108.50 ©=Alt)e | 110.52 (1 + i) 
100.06 (c„) | 106.66 |£ — xol 109.19 |f(x)|? 
100.05 B(G) | 105.52 Seer 106.22 |Vul?da 
99.87 logyn | 104.91 L (R?) 102.86 n(n-1)/2 
99.65 €(x) | 103.70 &=Ax+Bu | 101.40 O(n!) 
C6 C7 
110.83 1+ |2|?)2 98.72 div(|vu Vu) 
105.69 Fre) 7 
94.14 (2) =e Se ae - 
92.33 (IVul?-? Vu) - 
87.27 (log n/ log log n) - 
78.54 O (n log? n) - 
Even though we asked a domain expert to annotate the results as relevant or not, there is still 
plenty of room for discussion. For instance, (x + y) (rank 15 in zbMATH, ‘Beta Function’) is the 


argument of the gamma function I (x + y) that appears in the definition of the beta function 
[98, (5.12.1)] B(x, y) := T(x)T(y)/T(x + y). However, this relation is weak at best, and thus 
might be considered as not related. Other examples are Rez and Re(s), which play a crucial 
role in the scenario of the Riemann hypothesis (all non-trivial zeroes have Re(s) = 3). Again, 
this connection is not obvious, and these expressions are often used in multiple scenarios. Thus, 
the domain expert did not mark the expressions as being related. 


Considering the differences in the documents, it is promising to have observed a relatively high 
number of shared hits in the results. Further, we were able to retrieve some surprisingly good 
insights from the results, such as extracting the full definition of the Riemann zeta function 
[98, (25.2.1)] ¢(s) := Xp: a Even though a high number of shared hits seem to substantiate 
the reliability of the system, there were several aspects that affected the outcome negatively, 
from the exact definition of the search queries to retrieve documents via ES, to the number of 


retrieved documents, the minimum hit frequency, and the parameters in mBM25. 


Chapter 3 
Semantification of Mathematical LaTeX 


Section 3.2. Semantification with Mathematical Objects of Interest 


3.2.5 Applications 


The presented results are beneficial for a variety of use-cases. In the following, we will demon- 
strate and discuss several of the applications that we propose. 


Extension of zbMATH’s Search Engine Formula search engines are often counterintuitive 
when compared to textual search, since the user must know how the system operates to enter a 
search query properly (e.g., does the system supports KIEX inputs?). Additionally, mathematical 
concepts can be difficult to capture using only mathematical expressions. Consider, for example, 
someone who wants to search for mathematical expressions that are related to eigenvalues. A 
textual search query would only retrieve entire documents that require further investigation 
to find related expressions. A mathematical search engine, on the other hand, is impractical 
since it is not clear what would be a fitting search query (e.g., Av = Av?). Moreover, formula 
and textual search systems for scientific corpora are separated from each other. Thus, a textual 
search engine capable of retrieving mathematical formulae can be beneficial. Also, many search 
engines allow for narrowing down relevant hits by suggesting filters based on the retrieved 
results. This technique is known as faceted search. The zbMATH search engine also provides 
faceted search, e.g., by authors, or year. Adding facets for mathematical expressions allows 
users to narrow down the results more precisely to arrive at specific documents. 


Our proposed system for extracting relevant expressions from scientific corpora via mBM25 
scores can be used to search for formulae even with textual search queries, and to add more 
filters for faceted search implementations. Table 3.7 shows two examples of such an extension 
for zbDMATH’s search engine. Searching for ‘Riemann Zeta Function’ and ‘Eigenvalue’ retrieved 
4,739 and 25,248 documents from zbMATH, respectively. Table 3.7 shows the most frequently 
used mathematical expressions in the set of retrieved documents. It also shows the reordered 
formulae according to a default TF-IDF score (with normalized term frequencies) and our 
proposed mBM25 score. The results can be used to add filters for faceted search, e.g., show 
only the documents which contain u € W, fs P(Q). Additionally, the search system now provides 
more intuitive textual inputs even for retrieving mathematical formulae. The retrieved formulae 
are also interesting by themselves, since they provide insightful information on the retrieved 
publications. As already explored with our custom document search system in Figure 3.6, the 
Riemann hypothesis is also prominent in these retrieved documents. 


The differences between TF-IDF and mBM25 ranking illustrates the problem of an extensive 
evaluation of our system. From a broader perspective, the hit Ax = Bz is highly correlated 
with the input query ‘Eigenvalue’. On the other hand, the raw frequencies revealed a prominent 
role of div(|Vul|?-? Vu). Therefore, the top results of the mBM25 ranking can also be considered 
as relevant. 


Math Notation Analysis A faceted search system allows us to analyze mathematical nota- 
tions in more detail. For instance, we can retrieve documents from a specific time period. This 
allows one to study the evolution of mathematical notation over time [54], or for identifying 
trends in specific fields. Also, we can analyze standard notations for specific authors since it is 
often assumed that authors prefer a specific notation style which may vary from the standard 
notation in a field. 
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Table 3.7: The top-5 frequent mathematical expressions in the result set of zbMATH for the 
search queries ‘Riemann Zeta Function’ (top) and ‘Eigenvalue’ (bottom) grouped by their com- 
plexities (left) and the hits reordered according to their relevance scores (right). The TF-IDF 


score was calculated with normalized term frequencies. 


Riemann Zeta Function 


C1 C2 C3 C4 
15,051 n | 4663 (s) | 1,456 C(s) 39 (4 + it) 
11,709 s | 2460 (2) | 340 o+it | 232 (1/2+ it) 
9,768 a | 2,163 (n) | 310 eo. | ie Gee 
8,913 k | 1,485 (t) 275. (logT) | 136 Lit 
8,634 T | 145 it 264 1/2+it | 97 s=oHit 
C5 C6 TF-IDF mBM25 
203 ¢(4+it) | 105 |¢(1/2 + it)| C(s) ¢ (1/2 + it) 
166 C(1/2+it) | 88 [Gti] | cay2+i) | (1/2 +i) 
124 Clo+i) | 81 |[¢(o+2t)| || (1/2 +it) (+) 
54 C(1+ it) | 32 |¢(1 + it)| art ¢(5 +it) 
44 C(Q2n+1) | 22 = |¢(+it)| (È + it) (o + it) 
Eigenvalue 
C1 C2 C3 C4 
45,488 n | 12,515 (x) | 68 —Au | 218 |VulP? 
43,090 x | 6598 (t 55 (n-1) | 218 -A,u 
37,434 A| 437 A 521 |Vul 133 W,y?(9) 
35,302 u | 2,787 (Q) | 512 a, 127 [Vu]? 
22,460 t 2,725 R” | 4955 u) 97 (a;;) 
C5 C6 TF-IDF mBM25 
139 |Vul?-? Vu |137 (IVa Vu) Ax = \Bx|— div (IVa Vu) 
68 —d?/dx? | 35 ~(py'y’ —Ap div (Vap? Vu) 
51 A=(a,) | 26 (wu) P(A) p= %2 
46 -& 18 (opun) Res (©, wy 
45 u e WEP(Q)| ı8 fo|Vul?dr | A, >0 NE (OI) 
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Table 3.8: Suggestions to complete ‘E = m’ and ‘E = {m, cF (the right-hand side contains m 
and c) with term and document frequency based on the distributions of formulae in arXiv. 


Auto-completion for ‘E =m’ Suggestions for ‘E = {m, c} 


Sug. Expression TF DF | Sug. Expression TF DF 
E=md 558 376 | E = me 558 376 
E = mcoshé 23 23 | E = ymce? 39 38 
E = mo 7 7|E= ym, 41 36 
E=m/V1—-@ 12  6]|E=mcoshé 23 23 
E=m/y1-® 10 6|E=-me 35 17 
E = mcy 6 6 m?c + pe 10 8 


Math Recommendation Systems The frequency distributions of formulae can be used to 
realize effective math recommendation tasks, such as type hinting or error-corrections. These 
approaches require long training on large datasets, but may still generate meaningless results, 
such as G; = {(x,y) € R” : x; = x,} [400]. We propose a simpler system which takes 
advantage of our frequency distributions. We retrieve entries from our result database, which 
contain all unique expressions and their frequencies. We implemented a simple prototype that 
retrieves the entries via pattern matching. Table 3.8 shows two examples. The left side of 
the table shows suggested autocompleted expressions for the query ‘E =m’. The right side 
shows suggestions for ‘E =’, where the right-hand side of the equation should contain m and 
cin any order. A combination using more advanced retrieval techniques, such as similarity 
measures based on symbol layout trees [92, 407], would enlarge the number of suggestions. 
This kind of autocomplete and error-correction type-hinting system would be beneficial for 
various use-cases, e.g., in educational software or for search engines as a pre-processing step of 
the input. 


Plagiarism Detection Systems As previously mentioned, plagiarism detection systems 
would benefit from a system capable of distinguishing conventional from uncommon nota- 
tions [253, 254, 334]. The approaches described by Meuschke et al. [254] outperform existing 
approaches by considering frequency distributions of single identifiers (expressions of com- 
plexity one). Considering that single identifiers make up only 0.03% of all unique expressions 
in arXiv, we presume that better performance can be achieved by considering more complex ex- 
pressions. The conferred string representation also provides a simple format to embed complex 
expressions in existing learning algorithms. 


Expressions with high complexities that are shared among multiple documents may provide 
further hints to investigate potential plagiarisms. For instance, the most complex expression 
that was shared among three documents in arXiv was Equation (3.7). A complex expression 
being identical in multiple documents could indicate a higher likelihood of plagiarism. Further 
investigation revealed that similar expressions, e.g., with infinite sums, are frequently used 
among a larger set of documents. Thus, the expression seems to be a part of a standard notation 
that is commonly shared, rather than a good candidate for plagiarism detection. Resulting from 
manual investigations, we could identify the equation as part of a concept called generalized 
Hardy-Littlewood inequality and Equation (3.7) appears in the three documents [24, 292, 304]. All 
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‘Jacobi Polynomial’ 


mBM25 arXiv zbMATH mBM25 
83.0685 P(g) eo P(") (x) 98.6617 
79.6727 Pi) e o pe?) 87.9470 
77.1217 (1-4) ° e (L—2)°(1+ x)’ 87.4794 
76.6752 gti é ° (1+2)? 86.2089 
73.2642 Pa (x) é fe (1—2)* 85.3332 
70.9366 (n+ 8) $ fe C[-1,1] 80.2347 
69.3206 (1 — x)°(1 + z)? Fi @ (cos6) 79.5455 
68.2341 qh é ° ap > =l 78.3570 
64.9681 (8+1) ė é (a, a) 75.0357 
62.5399 (1-2) e e B>-l 73.4004 
56.3915 (z-1) e b (-1)" 68.9349 
54.4658 n(x) é e 3 66.6347 
51.7172 Pa~ ė e bedi 64.2537 
50.8955 1-2? é (+z) 63.6599 
4.283 (n+a+ß+1) e 1-2? 63.6560 
48.2607 q” e a>-l 61.5281 
44.4237 kað e (xx) 60.4231 
42.8024 (a+1) « e 28 +1 59.5472 
42.2566 (+ x)? é e oF, 59.3689 
41.1492 r e è (1-2) 58.2231 


Figure 3.7: The top ranked expression for ‘Jacobi polynomial’ in arXiv and zbMATH. For arXiv, 
30 documents were retrieved with a minimum hit frequency of 7. 


three documents shared one author in common. Thus, this case also demonstrates a correlation 
between complex mathematical notations and authorship. 


Semantic Taggers and Extraction Systems We previously mentioned that semantic extrac- 
tion systems [214, 329, 330] and semantic math taggers [71, 402] have difficulties in extracting 
the essential components (MOls) from complex expressions. Considering the definition of the 
Jacobi polynomial in Equation (3.2), it would be beneficial to extract the groups of tokens that 
belong together, such as PL) (2) or P(@ + m+ 1). With our proposed search engine for 
retrieving MOls, we are able to facilitate semantic extraction systems and semantic math tag- 
gers. Imagine such a system being capable of identifying the term ‘Jacobi polynomial’ from the 
textual context. Figure 3.7 shows the top relevant hits for the search query ‘Jacobi polynomial’ 
retrieved from zbMATH and arXiv. The results contain several relevant and related expres- 
sions, such as the constraints a, 8 > —1 and the weight function for the Jacobi polynomial 
(1 — x)*(1 + x)P, which are essential properties of this orthogonal polynomial. Based on 
these retrieved MOIs, the extraction systems can adjust its retrieved math elements to improve 
precision, and semantic taggers or a tokenizer could re-organize parse trees to more closely 
resemble expression trees. 
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3.2.6 Outlook 


In this first study, we preserved the core structure of the MathML data which provided insightful 
information for the MathML community. However, this makes it difficult to properly merge 
formulae. In future studies, we will normalize the MathML data via MathMLCan [117]. In 
addition to this normalization, we will include wildcards for investigating distributions of 
formula patterns rather than exact expressions. This will allow us to study connections between 
math objects, e.g., between I (z) and T (x+1). This would further improve our recommendation 
system and would allow for the identification of regions for parameters and variables in complex 
expressions. 


3.3 Semantification with Textual Context Analysis 


The results of our math embedding experiments and the introduction of MOI motivates us to 
develop a context-sensitive KIEX to CAS translation approach around the MOI concept. In this 
section, we briefly discuss our novel approach to perform context-sensitive translations from 
BIFX to CAS, which concludes research task II. We focus on three main sources of semantic 
information to disambiguate mathematical expressions sufficiently for such translations: 


1. the inclusive structural information in the expression itself; 
2. the textual context surrounding the expression; and 
3. acommon knowledge database. 


The first source is what most existing translators rely on by concluding the semantics from a 
given structure. The second source is rather broad. The necessary information can be given 
in the sentences before and after an equation, somewhere in the same article, or even through 
references (e.g., hyperlinks in Wikipedia articles or citations in scientific publications). In 
this thesis, we will focus on the textual context in a single document, i.e., we do not analyze 
references or deep links to other articles yet. The last source can be considered a backup option. 
If we cannot retrieve information from the context of a formula, the semantic meaning of a 
formula might be considered common knowledge, such as 7 referring to the mathematical 
constant. 


We extract knowledge from each of the three sources with different approaches. For the inclusive 
structural information, we rely on the semantic KIEX macros developed by Miller [260] for 
the DLMF that define standard notation patterns for numerous OPSF. To analyze the textual 
context of a formula, we rely on the approach proposed by Schubotz et al. [330], who extracted 
noun phrases to enrich identifiers semantically. As a backup common knowledge database, we 
use the POM tagger developed by Youssef [402] that relies on manually crafted lexicon files 
with several common knowledge annotations for mathematical tokens. 


3.3.1 Semantification, Translation & Evaluation Pipeline 


Figure 3.8 illustrates the pipeline of the proposed system to convert generic KIEX expressions 
to CAS. The figure contains numbered badges that represent the different steps in the system. 
Steps 1-4 represent the conversion pipeline, while steps 5-7 are different ways to evaluate the 
system. 
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Figure 3.8: Pipeline of the proposed context-sensitive conversion process. The pipeline consists 
of four semantification steps (1-4) and three evaluation approaches (5-7). 


The conversion pipeline starts with mathosphere* (step O Mathosphere is the Java frame- 
work developed by Schubotz et al. [279, 329, 330] in a sequence of publications to semantically 
enrich mathematical identifier with defining phrases from the textual context. First, we will 
modify mathosphere so that it extracts MOI-definiens pairs rather than single identifiers (step 
(ab). For this purpose, we propose the following significant simplification: an isolated mathe- 
matical expression in a textual context is considered essential and informative. Hence, isolated 
formulae are defined as MOI. Moreover, mathosphere scores identifier-definiens pairs in regard 
of their first appearance in a document (since the first declaration of a symbol often remains 
valid throughout the rest of the document [394]). We adopt this scoring for MOI with a matching 
algorithm that allows us to identify MOI within other MOI in the same document (step Ge). 


Step ©) is currently optional and combines the results from the MOI-definiens extraction 
process with the common knowledge database of the POM tagger. The information can then 
be used to feed existing BIFX to MathML converters with additional semantic information. In 
Chapter 2, we created a MathML benchmark, called MathMLben, to evaluate such converters. 
We have also shown that, for example, KIExmr can adopt additional semantic information via 
given semantic macros. Hence, via step © (and subsequently step ©) we can evaluate our 
semantification so far with the help of existing converters. The steps O) ©. and © are not 
subject of this thesis but part of upcoming projects. 


“https: //github.com/ag-gipp/mathosphere [accessed 03-24-2020] 
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Besides this optional evaluation over MathMLben, we continue our main translation path. 
Once we extracted the MOI-definiens pairs, we replace the generic BIEX expressions by their 
semantic counterparts (step ©). We do so by indexing semantic BIFX macros so that we can 
search for them by textual queries. Afterward, we are able to retrieve semantic KIEX macros by 
the previously extracted definiens. Finally, we create replacement patterns so that the generic 
BIFX expression can be replaced with the semantic enriched semantic macros from the DLMF. 
The result should be semantic KIEX, which enables another evaluation method. Consider we 
perform this pipeline on the DLMF, we can compare the generated semantic BIFX with the 
original, manually crafted semantic BIFX source in the DLMF to validate its correctness (step 
©). Unfortunately, the entire pipeline focuses on the textual context. The DLMF does not 
provide sophisticated textual information because semantic information is available via special 
infoboxes, through hyperlinks, or in tables and graphs. A more comprehensive evaluation 
approach can be enabled by further translating the expressions to the syntax of CAS via BCsT 
as we have shown in previous projects [2] (step @). namely symbolic and numeric evaluations. 
Moreover, this evaluation is most desired since it evaluates the entire proposed translation 
pipeline, from the semantification via mathosphere and the semantic KIEX macros, and the final 
translation via ACAT. The next chapter will aim to realize this proposed pipeline. The steps @ 
and @ are discussed in Chapter 4. The step @ is subject of Chapter 5. Step © has not been 
realized due to the reduced amount of textual context within the DLMF. Steps Q), O3 and © 
are subject of future work. 


This Chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License 


(http://creativecommons.org/licenses/by/4.0/). 
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This chapter addresses research tasks III and IV, i.e., implementing a system for automated 
semantification and translation of mathematical expressions to CAS syntax. In the previous 
chapter, we laid the foundation for a novel context-sensitive semantification approach that 
extracts the semantic information from a textual context and semantically enriches a formula 
with semantic KIEX macros. In this chapter, we realize this proposed semantification approach 
on 104 English Wikipedia articles with 6,337 mathematical expressions. However, before we 
continue with this main track, we first apply a novel context-agnostic machine translation 
approach for translations from KIEX to Mathematica. 


Previously, we have evaluated that rule-based translators are rather limited. Mostly because 
the rules are carefully selected and manually crafted. This manual approach makes it difficult 
to estimate the level of semantics that can be concluded directly from an expression (due to its 
structure, notation style, or the including symbols). Finding patterns in large data is a classic 
task for ML solutions. Hence, we will first elaborate the effectiveness of a machine translation 
approach in Section 4.1. We will see that the machine translation approach is very effective in 
adopting the notation style generated by Mathematica’s KIEX exports but fails to generalize the 


Supplementary Information The online version contains supplementary material available at 
https://doi.org/10.1007/978-3-658-40473-4_4. 


© The Author(s) 2023 
A. Greiner-Petter, Making Presentation Math Computable, 
https://doi.org/10.1007/978-3-658-40473-4_4 


95 


96 


Section 4.1. Context-Agnostic Neural Machine Translation 


trained patterns on real world scenarios or other libraries. A qualitative evaluation on the DLMF 
of the same model underlines the inappropriateness of the approach for a general translator. 
Nonetheless, the model still outperforms Mathematica’s internal KIEX import function. 


The machine translation approach presented in Section 4.1 partially contains excerpts of our! 
upcoming submission to the ACL Conference 2023.The Section 4.2 has been accepted for pub- 
lication in the upcoming issue of the IEEE Transactions on Pattern Analysis and Machine 
Intelligence (TPAMI) [11] journal. In order to provide a coherent story line, Section 4.2 only 
presents the first half of the TPAMI submission. The second half, the evaluation and discussion 
sections, subsequently continues in Chapter 5. 


4.1 Context-Agnostic Neural Machine Translation 


Mathematical formulae are generally longer compared to natural language sentences. 98% of 
the sentences in the Stanford Natural Language Inference (SNLI) entailment task, for example, 
contain less than 25 words [48]. In contrast, the average number of Mathematica tokens in the 
Mathematical Functions Site (MFS) dataset is 173. Short and long expressions are relatively 
rare but have a wider range compared to natural language sentences, e.g., 2.25% contain less 
than 25 tokens and 2.1% contain more than 1,024 tokens. Meanwhile, a vocabulary of such a 
mathematical language only contains 1k tokens compared to 60% tokens for a news classification 
model [410]. 


The most common neural machine translation models are sequence-to-sequence recurrent 
neural networks [355], tree-structured recursive neural networks [136], transformer sequence- 
to-sequence networks [371], and convolutional sequence-to-sequence networks [130]. For 
natural language translation tasks, transformer networks are known to outperform the oth- 
ers [130, 277, 371]. In this Section, we use convolutional sequence-to-sequence networks [130] 
since they perform better on our mathematical language. In regard of related work, only a few 
approaches for mathematical language translations exists [95, 219, 275, 296, 373, 375, 376, 379]. 


4.1.1 Training Datasets & Preprocessing 


We used two datasets for our experiments: the Mathematical Functions Site (MFS)? and parts 
of the DLMF. For the MFS, we fetched all formulea in Mathematica’s InputForm* and ex- 
ported every expression with Mathematica’s internal TeXForm? export function. This process 
generated 307,409 expression pairs in BIFX and Mathematica notation. We do the same for 
the DLMF dataset but use KIExML for the conversion from semantic BIFX to BIFX. From the 
DLMF, we generated 11,605 pairs in BIFX and semantic BIFX notation. Note that KIExmL and 
Mathematica’s TeXForm are rule-based translators. Hence, the generated data is limited to the 
abilities of the methods we used. Finally, we parsed the data in binary trees in postfix notation 
with the help of a custom rule-based tokenizer for KIEX and Mathematica expressions. 


‘The translator is a project by Felix Petersen and was supervised by Moritz Schubotz and assisted by me. In 
particular, I evaluated the model on the DLMF dataset and helped to devise the final paper for publication. The 
section has been mostly rewritten and shortened to the main findings for avoiding conflicts. 

“Even though more recent discussions argue that such large vocabularies are often not required and can be 
significantly reduced in size without a dramatic decrease in the model’s performance [70]. 

Shttp://functions.wolfram.com/ [accessed 2021-09-20] 

“https: //reference.wolfram.com/language/ref/InputForm.html [accessed 2021-09-20] 

https: //reference.wolfram.com/language/ref/TeXForm. html [accessed 2021-09-20] 
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4.1.2 Methodology 


Besides our final convolutional sequence-to-sequence model [130], we also experimented with 
Long-Short-Term-Memory (LSTM) recurrent networks [369], recurrent, recursive, and trans- 
former neural networks [130, 277, 371], and LightConv [397] as an alternative to the classic 
convolutional sequence-to-sequence models [130]. However, our model outperformed all other 
approaches. In the following, we list the hyperparameters and additional design choices that 
performed best for our experiments. 


We use 


Learning Rate, Gradient Clipping, Dropout, and Loss: a learning rate of 0.25, applied 
gradient clipping on gradients greater than 0.1, and a dropout rate of 0.2, and a label 
smoothed cross-entropy for the loss; 


State/Embedding Size(s): a single state size of 512 tokens; 


Number of Layers: 11 layers; 


Batch Size: 48 000 tokens per batch (which is equivalent to a batch size of about 400 
formulae); and 


Kernel Size: 3. 


Since the MFS dataset contains more than 104 multi-digit numbers (in contrast to less than 10° 
non-numerical tags), these numbers cannot be interpreted as conventional tags. Thus, numbers 
are either split into single digits or replaced by variable tags. Splitting numbers into single 
digits causes significantly longer token streams, which degrades performance. Substituting 
all multi-digit numbers with tags like <number_01> improved the exact match accuracy of 
the validation data set from 92.7% to 95.0%. We use a total of 32 of such placeholder tags as 
more than 99% of the formulae have less or equal to 32 multi-digit numbers. We randomly 
select the tags that we substitute the numbers with. Since multi-digit numbers basically always 
perfectly correspond in the different mathematical languages, we directly replace the tag with 
their corresponding numbers after the translation. Lastly, we split the MFS dataset into 97% 
training data, 0.5% validation data, 2.5% test data and split the semantic KIEX data set into 
90% training data, 5% validation data, and 5% test data since this set is smaller. 


4.1.3 Evaluation of the Convolutional Network 


In the following, we use three evaluation metrics: Exact Matches (EM), Levenshtein dis- 
tance [227], and Bilingual Evaluation Understudy (BLEU) [282]. The EM and the Levenshtein 
distance are calculated on the comparison of the sequence of Mathematica and TFX tokens. 
Hence even BIFX equivalent expressions, such as E=mc*2 and E=mc* {2}, are not considered 
as an EM. Due to the two additional curly brackets, the Levenshtein distance between both 
expressions is 2. We further denote the share of translations that have a Levenshtein distance 
of up to 5 by LD <;, and denote the average Levenshtein Distance by LD. 


The BLEU score is a quality measure that compares the machine’s output to a translation by 
a professional human translator. It compares the n-grams (specifically n = 1,..., 4) between 
the prediction and the ground truth. Since the translations in the data sets are ground truth 
values instead of human translations, for the back-translation of formulae, this metric reflects 
the closeness to the ground truth. BLEU scores range from 0 to 100, with the higher value 
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indicating the better result. For a comparison to natural languages, state-of-the-art translators 
from English to German reach a 35.0 and from English to French a 45.6 BLEU score [102]. That 
the BLEU scores for formula translations are significantly higher than the scores for natural 
language can be attributed to the larger vocabularies in natural language and a considerably 
higher variability between correct translations. 


In addition to this, we also perform round trip experiments from KIEX into Mathematica and 
back again on the im2latex-100k° dataset [95]. This dataset consists about 100k formulae 
from papers of arXiv, including their renderings. The in2latex-100k task’s concept was the 
conversion of mathematical formulae from images into BIFX via OCR. We use it as an additional 
source for more general mathematical expressions instead. For our round trip experiment, we 
translate all KIEX expressions into Mathematica with the internal KIEX import function and our 
convolutional sequence to sequence model. Afterward, we use Mathematica’s export function 
to generate BIFX again. Finally, we compare this round trip translated KIEX with the original 
input formula. Note that 66.8% of the equations in the im2latex-100k data set contain tokens 
that are not in our model’s vocabulary. 


4.1.3.1 Results 


Table 4.1 show the results of our convolutional sequence to sequence model for translations to 
Mathematica and semantic BIFX evaluated with the EM rate and the BLEU score. We achieved 
an EM accuracy of 95.1% and a BLEU score of 99.68 for translations to Mathematica. For 
the translation from BIFX to semantic BIFX, we achieved an EM accuracy of 90.7% and a 
BLEU score of 96.79. Table 4.2 we compare our model with Mathematica’s internal TEX import 
function on the two datasets MFS and im2latex-100k. While the accuracy drops on a new 
dataset, our model still outperforms Mathematica’s import function on all metrics. Lastly, for a 
more qualitative analysis, we evaluated our model on 100 random samples of DLMF formulae 
manually, i.e., we did not check the EM or BLEU score but a human annotator manually checked 
if a translation was correct or at least syntactically valid (which is the same as the previously 
used Import metric). All 100 samples and the results are available in Table E.1 in Appendix E.1 
available in the electronic supplementary material. Table 4.3 show the comparison of our model 
with Mathematica’s import function and our previously developed translator BCT [13]. As 
we can see, on these random samples, Mathematica outperforms our model but ACT performs 
best. Nonetheless, ACAST was specifically designed for translations on the DLMF, which allows 
BCs to correctly anticipate the usage of constants, such as i for the imaginary unit or e for 
Euler’s number. 


Table 4.1: Results for the backward translation. 


Metric BIFX — Mathematica KIEX — semantic KIEX 
EM 95.1% 90.7% 
BLEU 99.68 96.79 


https: //paperswithcode.com/dataset/im2latex- 100k [accessed 2021-09-21] 
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Table 4.2: Comparison between Mathematica and our model on backward translation of the 
formulae of the MFS and im2latex-100k dataset. Import denotes the fraction of formulae that 
can be imported by Mathematica, i.e., the translation was syntactically valid. 


Dataset ı Method EM Import LDz; LD 


: Mathematica 2.7% 88.5% 16.4% 88.7 
' Conv. Seq2Seq 95.1% 98.38% 96.7% 0.615 


' Mathematica 15.3% 0.153% 2.30% 18.3 
im2latex-100k |! 


' Conv. Seq2Seq 16.3% 0.698% 2.56% 12.9 


Table 4.3: Qualitative comparison between Mathematica, ACAST, and our model on 100 random 
DLMF samples. X indicate wrong translations. W indicate correct translations. As in Table 4.2, 
Import denotes syntactically valid translations. The full dataset is available in Appendix E.1 
available in the electronic supplementary material. 


Method Import WoW x 
Mathematica 71% 11% 89% 
BCT 57% 22% 78% 


Conv. Seq2Seq 45% 5% 95% 


4.1.3.2 Qualitative Analysis and Discussion 


We constitute that our model successfully outperforms Mathematica on various scenarios. A 
good example for this is the following equation’: 


o (Z — 23 92, 93) 0 (Z + 20} Jo, 9: = 
( 03 92 93) ( 0 aA 93)... =p 1 (0; 9993) - (4.1) 
a (2; 92, 93)” 7 (20; 92593) 


9 (2; 92, 93) 


The symbol p (\wp) is properly interpreted by the model and Mathematica as the Weierstrass’ 
elliptic function 9 (WeierstrassP). That is because the symbol ¢ is uniquely tied to the 
Weierstrass p function. The inverse of this function, 7t is also properly interpreted by 
both systems as the InverseWeierstrassP. However, o was not properly interpreted by 
Mathematica as the WeierstrassSigma presumably due to the ambiguity of ø. Considering 
the expression is from the MFS and ¢ appears in the same expressions, we can conclude that 
a is referring to the WeierstrassSigma. Our model was able to capture this connection and 
correctly translate the entire expression. 


The low scores of Mathematica on their own dataset can be attributed to the fact that Math- 
ematica does not attempt to disambiguate its own exported expressions. As we discussed 
earlier, an export from a computational language to a presentation language loses semantic 
information. Our sequence to sequence model was able to restore the semantic information 
under the assumption that the input was generated from the MFS via Mathematica. Hence, our 
model performs very well on the trained data but is unable to produce reliable translations on 


"Extracted from https : / / functions . wol f ram .co m /EllipticFunctions / WeierstrassP / 
introductions/Weierstrass/04/ [accessed 2021-09-14] 
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Table 4.4: Examples of our machine translations from KIEX to Mathematica for the same 
expressions as in Table 1.2 from the introduction. The column MM refers to the results of Math- 
ematica’s import function from Table 1.2 for comparison. Circled results (®)) are syntactically 
invalid. 


BIFX ' Rendering | Conv. Seq2Seq Translation | MM 
\int_a°b x dx | Me zdz | ntegrate [x7 (x,x) (®) 2 x 
\int_a°b x \mathrm{d}x | pe rdr : Integrate[x”x,a] x: X 
\int_a”b x\, dx | J? xdr | ntegrate[x bx, x] x | Y 
\int_a^b x\; dx ' fè x dx ' ntegrate[x^bx,x] x ' x 
\int_a^b x\, \mathrm{d}x | ffi xdr | ntegrate[x’a”bx,x] x | x 
\int_a™b \frac{dx}{x} ' J? d ' ntegrate[(dx)/x &) | xX 
\sum_{n=0}-N n72 | SN yn?! Sum[n“2,{n,0,N}] viv 
\sum_{n=0}°N n’2 + n ' EN, n?+n ' Sum [n^2+n,{n,0,N}] +n x ' ? 
{n \choose m} | (ea : JacobiSymbol[n, m] x x 
\binom{n}{m} | (7) ; Binomial[n, m] v : v 
\inabxde oo | | fendi \Integrate[x,{t,a,bH vi 

\int_a™b x72 dx | ye x?dx | Integrate[x”2,1{x,a,b}] v | 
\int_{a}”{b} x72 dx ' T a2dax ' Integrate[x^2,{a,a,b}] X 


unseen, more general expressions. A first hint to this problem can be found in Table 4.3 for 
our evaluation on the 100 DLMF formulae. While our model clearly outperforms Mathemat- 
ica on the MFS dataset, the internal rule-based import function of Mathematica works more 
reliable on unknown expressions. One reason for the low performance of our model on the 
DLMF evaluation is our vocabulary. 71 of the 100 expressions contain tokens that are not in 
the Mathematica-export vocabulary. Hence, our model was unable to correctly interpret these 
expressions. This clearly underlines the limitation of the model. As an approach to mitigate 
this effect in the future, we could use multilingual translations [40, 174] which would allow 
learning translations and tokens that are not represented in the training data for the respective 
language pair. 


Additionally, we must note that every dataset we used has a significant bias. The DLMF and 
MFS specifically focus on OPSF. The im2latex-100k dataset was created from arXiv articles 
in the area of high energy physics®. A general limitation of neural networks is that trained 
models inherit biases from training data. For a successful formula translation, this means that 
the set of symbols, as well as the style in which the formulae are written, has to be present in 
the training data. Rather than learning the actual semantics of an expression, a model is able to 
capture the notation flavor / convention another tool produces, such as Mathematica’s export 
function or KIExmr. The generated BIEX from both Mathematica and KIExML, is limited to a 
specific vocabulary and does not allow variation as it is produced by rule-based translators. 


ë Phenomenology (hep-ph) and Theory (hep-th) specifically. 
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Because of the limited vocabularies as well as limited set of KIEX conventions in the data sets, 
the translation of mathematical BIFX expressions of different flavors is not possible. 


Due to the performance on the MFS and im2latex-100k datasets, we conclude that our model 
captures more patterns compared to Mathematica’s internal import methods. On the other 
hand, we have also shown that our model is unable to capture the semantic information of 
mathematical expressions but concludes semantics from patterns and token structures. Whether 
this semantics is correct or consistent with additional contextual information does not matter. 
Hence, our translation is rather unpredictable and susceptible to minor visual changes in the 
inputs. If we consider the simple examples from Table 1.2 from the introduction, we can see that 
our model is unable to correctly translate most expressions similar to Mathematica. Table 4.4 
shows the translations for our model. Three of the translations even consists obvious syntax 
errors, such as unbalanced brackets. In comparison to the first Table 1.2, we added three more 
examples to show that marginal changes may have a significant impact on the final translation. 
For example, simply changing the variable of integration from z to t in the first examples 
changes the outcome from a syntactically and semantically invalid expression to a correct and 
valid translation. Similarly, additional curly brackets around the limits of an integral may cause 
a wrong translation and an error that can be difficult to trace back if not immediately noticed’. 


Considering the simplicity of the expressions, a machine translation model alone might not 
be the correct approach for a reliable KIEX to CAS translator. Especially because such simple 
mistakes harms the trustworthiness of the entire engine. Since accuracy and precision are among 
the most important aspects in mathematics, our machine translator cannot be considered as 
compatible with existing rule-based approaches. A hybrid solution with ML-enhanced pattern 
recognition techniques and rule-based translations could be the more promising solution in the 
future. 


4.2 Context-Sensitive Translation 


Since the previous section has shown that machine translations are not as reliable as rule-based 
approaches, we continue to develop a more reliable strategy following heuristics that have been 
developed over time by studying mathematical notations. Specifically, we want to focus on a 
more broad source of mathematical expressions away from the strict notation guidelines in the 
DLMF and the less descriptive scientific articles in arXiv. In the following, we will focus on 
Wikipedia articles as our primary source for mathematical expressions. 


4.2.1 Motivation 


Like many other knowledge base systems, Wikipedia encodes mathematical formulae in a rep- 
resentational format similar to KIEX [156, 17, 405]. While this representational format is simple 
to comprehend by readers possessing the required mathematical training, an additional explicit 
knowledge of the semantics associated with each expression in a given formula, could make 
mathematical content in Wikipedia even more explainable, unambiguous, and most impor- 
tantly, machine-readable. Additionally, making math machine-readable can allow even visually 
impaired individuals to receive a semantic description of the mathematical content. Finally, 
and crucially, moderating and curating mathematical content in a free and community-driven 


’Here the variable of integration switched from x to a in the translated expression due to the redundant curly 
brackets around the limits of the integral. This error can be easily overlooked. 
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Via the hypergeometric function | edit 


The Jacobi polynomials are defined via the hypergeometric function as follows:!1 


PB) (x) = @+Dn h 2k, (-n, lta+ß+na+ L - 2), 
where (a +1 Definition: Jacobi polynomial al). In this case the 
series for the PP) (x) Jacobi polynomial ıbtains the following 
equivalent ex (a+1), Pochhammer's symbol Computer m 
P&B) (x) 2F,(a,b;c;z) Hypergeometric function Verified +m-+ 1) ( = 5) 
n! Factorial git 1) 2 


Figure 4.1: Mathematical semantic annotation in Wikipedia. 


encyclopedia like Wikipedia, is more time-consuming and error-prone without explicit access 
to the semantics of a formula. Wikipedia currently uses the Objective Revision Evaluation Service 
(ORES) to predict the damaging or good faith intention of an edit using multiple independent 
classifiers trained on different datasets [144]. The primary motivation behind ORES was to 
reduce the overwhelming workload of content moderation with machine learning classification 
solutions. Until now, the ORES system applies no special care to mathematical content. Estimat- 
ing the trustworthiness of an edit in a mathematical expression is significantly more challenging 
for human curators and almost infeasible for Artificial Intelligence (AI) classification models 
due to the complex nature of mathematics. 


In this section, we propose a semantification and translation pipeline that makes the math in 
Wikipedia computable via CAS. CAS, such as Maple [36] and Mathematica [393], are complex 
mathematical software tools that allow users to manipulate, simplify, plot, and evaluate math- 
ematical expressions. Hence, translating mathematics in Wikipedia to CAS syntaxes enables 
automatic verification checks on complex mathematical equations [2, 11]. Integrating such 
verifications into the existing ORES system can significantly reduce the overload of moder- 
ating mathematical content and increasing credibility in the quality of Wikipedia articles at 
the same time [359]. Since such a translation is context-sensitive, we also propose a seman- 
tification approach for the mathematical content. This semantification uses semantic BIFX 
macros [260] from the DLMF [98] and noun phrases from the textual context to semantically 
annotate math formulae. The semantic encoding in the DLMF provides additional information 
about the components of a formula, the domain, constraints, links to definitions, and improves 
searchability and discoverability of the mathematical content [260, 403]. Our semantification 
approach enables the features from the DLMF for mathematics in Wikipedia. Figure 4.1 provides 
an example vision of our semantic annotations and verification results in Wikipedia [17]. Head 
et al. [150] recently evaluated that providing readers information on the individual elements 
in mathematical expressions on-site [329, 394], such as shown in Figure 4.1, can significantly 
support users of all experience levels to read and comprehend articles more efficiently [150]. 


Mathematics is not a formal language. Its interpretation heavily depends on the context, e.g., 
T(x +y)! can be interpreted as a multiplication Tx + my or the number of primes less than or 
equal to x+y. CAS syntaxes, on the other hand, are unambiguous content languages. Therefore, 
the main challenge to enable CAS verifications for mathematical formulae in Wikipedia is a 


In the following, we use this color coding for examples to easily distinguish them from other mathematical 
content in this section. 
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reliable translation between an ambiguous, context-dependent format and an unambiguous, 
context-free CAS syntax. Hence, we derive the following research question: 


O Research Question 


What information is required to translate mathematical formulae from natural language 
contexts to CAS and how can this information be extracted? 


In this section, we present the first context-dependent translation from mathematical KIEX 
expressions to CAS, specifically Maple and Mathematica. We show that a combination of 
nearby context analysis (extraction of descriptive terms) and a list of standard notations for 
common functions provide sufficient semantic information to outperform existing context- 
independent translation techniques, such as CAS internal BIFX import functions. We achieve 
reliable translations in a four-step augmentation pipeline. These steps are: (1) pre-processing 
Wikipedia articles to enable natural language processing on it, (2) constructing an annotated 
mathematical dependency graph, (3) generating semantic enhancing replacement patterns, and 
(4) performing CAS-specific translations (see Figure 4.2). In addition, we perform automatic 
symbolic and numeric computations on the translated expressions to verify equations from 
Wikipedia articles [2, 11]. We show that the system is capable of detecting potential errors 
in mathematical equations in Wikipedia articles. Future releases could be integrated into the 
ORES system to reduce vandalism and improve trust in mathematical articles in Wikipedia. 
We demonstrate the feasibility of the translation approach on English Wikipedia articles and 
provide access to an interactive demo of our LaTeX to CAS translator (ACasT)!". 


For the evaluation of the translations, we focus on the sub-domain of OPSF. OPSF are generally 
well-supported by general-purpose CAS [13], which allows us to estimate the full potential 
of our proposed translation and verification pipeline. Since CAS syntaxes are programming 
languages, one has the option to add new functionality to a CAS, such as defining a new 
function. Defining new functions in CAS, however, can vary significantly in complexity. While 
translating a generic function like f(x) := x° is straightforward, defining the prime counting 
function from above could be very complex. If a function is explicitly declared in the CAS, 
we call a translation to that function direct. General mathematics often does not have such 
direct translations. For example, translating the generic function f(x) is meaningless without 
considering the actual definition of f(x). Hence, we first focus on translations of OPSF, which 
often have direct translations to CAS. In addition, OPSF are highly interconnected, i.e., many 
OPSF can be expressed (or even defined) in terms of other OPSF. One of the main tasks for 
our future work is to support more non-direct translations enabling our CAST to handle more 
general mathematics. 


In this section, we present our pipeline and discuss each of the augmentation steps. Section 4.2.2 
discusses related work. In Section 4.2.3, we introduce a formal definition for translating KIEX 
to CAS syntaxes. Section 4.2.4 explains necessary pre-processing steps for Wikipedia articles. 
Section 4.2.5 introduces our annotated dependency graph. Section 4.2.6 concludes by replacing 
generic KIEX subexpressions with semantically enriched macros from the DLMF. The evaluation 
and discussion subsequently continue in Chapter 5. 


"https://tpami.wmflabs. org [accessed 2021-09-01] 
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4.2.2 Related Work 


Our proposed pipeline tangents several well-known tasks from MathIR, namely descriptive en- 
tity recognition for mathematical expressions [183, 213, 279, 320, 329], math tokenization [402], 
math dependency recognition [14, 214], and automatic verification [2, 11]. Existing approaches 
to translate mathematical formulae from presentational languages, e.g., KIRX or MathML, to 
content languages, e.g., content MathML or CAS syntax, do not analyze the context of a for- 
mula [14, 270, 18]. Hence, existing approaches to translate KIEX to CAS syntaxes are limited to 
simple arithmetic expressions [18] or require manual semantic annotations [14]. Some CAS, 
such as Mathematica, support KIEX imports. Those functions fall into the first category [18] 
and are limited to rather simple expressions. A semantic annotation, on the other hand, can 
be directly encoded in KIEX via macros and allows for translations of more complex formu- 
lae. Miller et al. [260] developed a set of the previously mentioned semantic macros that link 
specific mathematical expressions with definitions in the DLMF [98]. The manually generated 
semantic data from the DLMF [403] was successfully translated to and evaluated by CAS with 
our proposed framework CAST [2, 13]. Therefore, our translation pipeline contains two steps: 
First, the semantic enhancement process towards the semantic BIFX dialect used by the DLMF. 
Second, the translation from semantic BIFX to CAS via ACAS. In this paper, we focus on the 
first step. The second phase is largely covered by [2, 11, 13]. A more comprehensive overview 
was given in Section 2.4. 


4.2.3 Formal Mathematical Language Translations 


First, we will introduce an abstract formalized concept for our translation approach followed 
by a detailed technical explanation of our system. Inspired by the pattern-matching translation 
approaches in compilers [263], we introduce a translation on mathematical expressions as 
a sequence of tree transformations. In the following, we mainly distinguish between two 
kinds of mathematical languages: presentational languages £ p, such as KIEX!? or presentation 
MathML”, and content languages Lç, such as content MathML, OpenMath [204], or CAS 
syntaxes [36, 393]. Elements of these languages are often referred to as symbol layout trees 
for e € Lp or operator trees for e € Lo [92]. Then we call a context-dependent translation 
t:Lp x X > Lo with t> t(e, X) appropriate if the intended semantic meaning of e € Lp 
is the same as t(e, X) € Lo. We further define the context X of an expression e as a set of 
facts from the document D the expression e appears in and a set of common knowledge facts 
K so that facts from the document may overwrite facts from the common knowledge set 


X:={flfeDUKA(fEKSf¢D)}. (4.2) 


A fact f is a tuple (MOI, MC) of a Mathematical Objects of Interest (MOI) [14] and a Mathe- 
matical Concept (MC). An MOI m refers to a meaningful mathematical object in a document 
and the MC uniquely defines the semantics of that MOI. In particular, from the MC of an MOI 
m, we derive a semantic enhanced version m of m so that m € Lo. Hence, from f, we derive 
a graph transformation rule r; = m — m and define g,(e) as the application e a € with 


e € Lp,č € Lo- 


We split the translation t(e, X) into two steps, a semantification t, (e, X) and a mapping t„ (e) 
step. The semantification t, (e, X) transforms all subexpressions € C e that are not operator 


https://www. latex-project .org/ [accessed 2021-06-29] 
Shttps://www.w3.org/TR/\gls{mathm1}3/ [accessed 2021-06-29] 
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trees, i.e., € E€ Lp \ Lo, to operator tree representations ee Lo. In the following, we presume 
that these subexpressions € are MOI so that we can derive € from a fact f € X. Then we define 
the semantification step as the sequence of fact-based graph transformations 


tle, X) = gp 0-097, (€), (43) 


with fp € X,k = 1,...,n. Again, we call a graph transformation g(e) appropriate if the 
intended semantics of the expression e and its transformation g(e) are the same. Further, we 
call t,(e, X) complete if all subexpressions e’ C t,(e, X) are in Lo and incomplete otherwise. 
Note that graph transformations are not commutative, i.e., there could be f4, fs € X so that 
Ip, © Ifae) F Ifa © Ip (e). 


The mapping step t, (e) is a sequence of applications on graph transformation rules that replace 
a node (or subtree) with the codomain-specific syntax version of the node (or subtree). Hence, 
the mapping step is a context-independent translation t„ : Lo, > Lo, With Lo,» Lo, C Le 
and a fixed rule set RE so that ry = Lo, — Lo, for rẹ € Re, kel;..., n. Then we define 


9) ’ ’ 


tm (€) = 9,00 gn, (6). (4.4) 


Note that t, (e) ignores subexpressions € C e that are not in Lo. For CAS languages £L y C Lo, 
certain subtrees of an expression č C e € £p are operator trees in the target language, € € L yy- 
Hence, we call t,,,(e) complete, if alle’ C e with e' € Lo, \ Lc, were transformed to Le,. 
Note that a complete t,,, (e) is not necessarily appropriate because such an e € Lp N Le could 
have a different semantic meaning in £L p and Lç (see the 7 example from the introduction). 


2 Definition of a Context-Sensitive Translation Function 


For a given target CAS language Ly, C Lo, a set of rules RẸ,, and a context X, we 
define the two step translation process as 


15 B ejay x Oe Le El) (4.5) 


We call t(e, X) complete ift,(e, X) and t,,,(e) are complete and appropriate. 


Splitting the translation t(e, X‘) into these two steps has the advantage of modularity. Consider- 
ing an appropriate and complete semantification, we can translate an expression e to any context 
language Lj, C Lo by using a different set of rules RS; for t,,,(e). In previous research, we 
developed BCxsT [3, 13] as an implementation of t,,, (e) between the content languages semantic 
BIFX [403] (the semantic enhanced KIEX used in the DLMF) and the CAS syntaxes of Maple and 
Mathematica. Technically, semantic KIEX is simply normal BIFX, where specific subexpressions 
are replaced by semantic enhanced macros. In this paper, we extend ACT to identify the 
subexpressions that can be replaced with these semantic BIFX macros. This semantification is 
our first translation step t,(e, X). The results t,(e, X) are in semantic KIEX which is in Lọ. For 
the second step (the mapping), we rely on the original ACAST implementation (from semantic 
BIFX to CAS syntaxes) for t,,, (e) and presume that t,, (e) is complete and appropriate [2, 11]. 


To perform a complete and appropriate semantification, we need to solve three remaining 
issues. First, how can we derive sufficiently many facts from a document f € D so that the 
transformation rules r, are appropriate and the semantification t,(e, X) is appropriate and 
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Figure 4.2: The workflow of our context-sensitive translation pipeline from KIEX to CAS syn- 
taxes. 


complete? Second, since the transformation rules are not commutative, a different order of 
facts may result in an inappropriate semantification t,(e, X). Hence, we need to develop a 
fact-ranking rk(f) so that the sequence of transformations is performed in an appropriate 
order. Third, how can we determine if a translation was appropriate and complete? There is 
no general solution available to determine the intended semantic information of an expression 
e € Lp. In turn, it is probably impossible to certainly determine if a translation is appropriate 
for general expressions. Therefore, we propose different evaluation approaches that allow 
automatically verifying the appropriateness and completeness of a translation. We performed 
the same evaluation approaches on the manually annotated semantic KTEX sources of the DLMF 
and successfully identified errors in the DLMF and the two CAS Maple and Mathematica [2, 
11]. Hence, we presume the same technique is appropriate to detect errors in Wikipedia too. In 
addition to these verification evaluations, we perform a manual evaluation on a smaller test set 
for a qualitative analysis. 


The number of facts (transformation rules) that we derive from a document D is critical. A 
low number of transformation rules may result in an incomplete translation. On the other 
hand, too many transformation rules may increase the number of false positives and result 
in an inappropriate transformation. To solve this issue, we propose a dependency graph of 
mathematical expressions containing the MOI of a document as nodes. A dependency in this 
graph describes the subexpression relationship between two MOI. We further annotate each 
MOI with textual descriptions from the surrounding context. We interpret these descriptions 
as references to the mathematical concepts MC that defines the MOI and rank each description 
according to distance and heuristic measures. Since MOI are often compositions of other MOI, 
the dependencies allow us to derive relevant facts for an expression e from the subexpressions 
e' C e. To derive a semantically enhanced version m for an MOI m, we use the semantic 
macros from the DLMF. Each semantic macro is a semantically enhanced version m of a 
standard representational m. To derive relevant semantic macros, i.e., transformation rules, we 
search for the semantic macro’s description that matches the MC of the facts. In turn, we have 
a large number of ranked facts with the same MOI m and a ranked list of transformation rules 
r1,+++,1,, for each fact f. The rankings allow us to control the number and order of the graph 
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transformation g,,(e) in t,(e, X). In turn, the annotated dependency graph should solve the 
mentioned issues one and two. The pipeline is visualized in Figure 4.2. The rest of this section 
explains the pipeline in more detail. The third issue, i.e., determining the appropriateness and 
completeness of a translation is discussed in Section 5.2 in Chapter 5. 


4.2.3.1 Example of a Formal Translation 


Consider the example from the introduction 7( + y) ina document D that describes m(x) as 
the prime counting function. Hence, we derive the fact 


f = (n(x), prime counting function) € D. (4.6) 


In our dependency graph, m(x + y) depends on 7(). Hence, we derive the same fact f for 
7(x + y). Based on this fact, we find a function in the DLMF described as ‘the number of primes 
not exceeding x’ which uses the semantic macro \nprimes@{x} and the presentation m(x). 
Hence, we derive the transformation rule 


re = \pi(v,) > \nprimes@{v;}, (4.7) 


where v} is a wildcard for variables. For simplicity reasons, this example only derived a single 
transformation rule r ; rather than an entire set of ranked rules and facts as described above. Our 
final pipeline will derive an entire list of ranked facts and replacement rules that are successively 
applied. BCwT defines a translation rule r} € R\jathematica for this function to PrimePi [x] and 
a rule r, € R Maple to pi (x) in Maple!*, respectively. Hence, the translation to Mathematica 


would be performed via 7", as 


t(\pi (x+y), X) = t,,(t,(\pi(xty),X)) (4.8) 
= Gy,(9¢(\pi Gety))) (4.9) 
= 9, (\nprimes@{x+y}) (4.10) 
= PrimePilxty]. (4.11) 


For Maple, the translation process is performed via r, instead 


t(\pi (x+y), X) = t,,(t,(\pi(xty), X)) (4.12) 
— Ira (G9 (\pi (x+y) )) (4.13) 
= g,,(\nprimes@{xty}) (4.14) 
= pi(xty). (4.15) 


This underlines the modular system of our translation pipeline. Further, BCT takes care of 
additional requirements for successful translations. In this particular example, ACT informs 
a user about the requirement of loading the NumberTheory package in Maple in order to use 


the translated expression pi (x+y). Note that the subexpression x + y was not transformed 
by g,(e) nor by g, (€), because x + y € Ly, N £p. Hence, this translation is complete and 
appropriate. 


“Maple requires to pre-load the NumberTheory package. 
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4.2.4 Document Pre-Processing 


For extracting the facts from a document D, we need to identify all MOI and MC. In previous 
research [329], we have shown that noun phrases can represent definiens of identifiers. Hence, 
we presume noun phrases are good candidates for MCs too. To properly extract noun phrases, 
we use CoreNLP [240] as our POS tagger [367, 368]. Since CoreNLP is unable to parse math- 
ematics, we replace all math by placeholders first. In a previous project [279], we proposed a 
Mathematical Language Processor (MLP) that replaces mathematical expressions with place- 
holders. Occasionally, this approach yields wrong annotations. For example, CoreNLP may tag 
factorial or polynomial as adjectives when a math token follows, even in cases where they are 
clearly naming mathematical objects!”. However, the MLP approach works reasonably well in 
most cases. 


Since Wikipedia articles are written in Wikitext, we use Sweble [99] to parse an article, replace 
MOI with placeholders, remove visual templates, and generate a plain text version of an article. 
Wikipedia officially recommends encoding in-line mathematics via templates that do not use 
BIEX encoding (see Appendix B available in the electronic supplementary material for more 
details about math formulae in Wikipedia). In addition, since Wikipedia is community-driven, 
many mathematical expressions are not properly annotated as such. This makes it challenging 
to detect all MOI in a given document. For example, the Jacobi polynomial article’® contains 
several formulae that do not use the math template nor the <math> tag (for KIEX), such as 
the single identifier ° °x’? and the UTF-8 character sequences € < 0, [e, {{pi}}-e], and 
0 < @ < 4{{pi}}. As an approach to detect such erroneous math, we consider sequences 
of symbols with specific Unicode properties as math. This includes the properties Sm for 
math symbols, Sk for symbol modifier, Ps, Pe, Pd, and Po for several forms of punctuation 
and brackets, and Greek for Greek letters. In addition, single letters in italic, e.g., ’ °x? ’, are 
interpreted as math as well, which was already successfully used by MLP. Via MLP we also 
replace UTF-8 characters by their TFX equivalent. In the end, the erroneous UTF-8 encoded 
sequence () < ~ < 4{{pi}} is replaced by 0 \leq \phi \leq 4\pi and considered as a 
single MOI. Using this approach, we detect 27 math-tags, 11 math-templates (including one 
numblk), and 13 in-line mathematics with erroneous annotations in the Jacobi polynomials 
article. The in-line math contains six single italic letters and seven complex sequences. In one 
case, the erroneous math was given in parentheses and the closing parenthesis was falsely 
identified as part of the math expression. Every other detection was correct. In the future, 
more in-depth studies can be applied to improve the accuracy of in-line math detection in 
Wikitext [123, 377]. 


4.2.5 Annotated Dependency Graph Construction 


Retrieving the correct noun phrase (i.e., MC) that correctly describes a single MOI is most likely 
infeasible. Instead, we will retrieve multiple noun phrases for each MOI and try to rank them 
accordingly. In the following, we construct a mathematical dependency graph for Wikipedia 
articles in order to retrieve as many relevant noun phrases for an MOI as possible. As we have 
discussed in an earlier project [214], there are multiple valid options to construct a dependency 
graph. We decided to use the POM tagger [402] to generate parse trees from BIFX expressions 


For example, "The Jacobi polynomial MATH_1 is an orthogonal polynomial. Both ‘polynomial’ tokens in this 
sentence are tagged as JJ (Adjective) with CoreNLP version 4.2.2. 
nttps://en.wikipedia.org/wiki/Jacobi_polynomials [accessed 2021-06-07] 
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to build a dependency graph. The POM tagger lets us establish dependencies by comparing 
annotated, semantic parse trees. Since the POM tagger aims to disambiguate mathematical 
expressions in the future, the accuracy of our new dependency graph directly scales with an 
increasing amount of semantic information available to the POM tagger. In addition, the more 
the POM tagger is able to disambiguate expressions, the more subexpressions € C e € Lp are 
already in our target language € € £y. Our translator BCT also relies on the parse tree of the 
POM tagger [3, 13]. Technically, this allows us to feed BCT directly with additional semantic 
information via manipulating the parse tree from the POM tagger. For example, consider the 
expression «a(b-+c). In general, ACAST would interpret the expression as a multiplication between 
a and (b + c), as most conversion tools would [18]. However, we can easily tag the first token a 
as a function in the parse tree and thereby change the translation accordingly without further 
programmatic changes. In the following, we only work on the parse tree of the POM tagger, 
which can be considered as part of £ p. 


To establish dependencies between MOI, we introduce the concept of a mathematical stem 
(similar to ‘word stems’ in natural languages) that describes the static part of a function that 
does not change, e.g., the red tokens in T(x) or P,,‘ ’) (x). Mathematical functions often have a 
unique identifier as part of the stem that represents the function, such as T (x) or Pi (x). The 
identification of a stem of an MOI, however, is already context-dependent. As our introduction 
example of T(x + y) shows, the location of the stem depends on the identification of (x + y) 
as the prime counting function. At this point in our pipeline, we lack sufficient semantic 
information about the MOI to identify the stem. On the other hand, a basic logic is necessary to 
avoid erroneous MOI dependencies. We apply the following heuristic for an MOI dependency: 
(i) at least one identifier must match in the same position in both MOI and (ii) this identifier is 
not embraced by parenthesis. Now, we replace every identifier in an MOI m, by a wildcard that 
matches a sequence of tokens or entire subtrees. If this pattern matches another MOI m, and the 
match obeys our heuristics (i) and (ii), we say m, depends on m, and define a directed edge from 
m; to my in the graph. With the second heuristic, we avoid a dependency between I (x) and 
T(x) (since x fulfill the first heuristic but not the second). In the future, it would be worthwhile 
to study more heuristics on MOI to identify the stem via machine learning algorithms. A more 
comprehensive heuristic analysis is desirable, since not every function has a unique identifier 
in the stem, e.g., the Pochhammer’s symbol (),,. Examples of dependencies between MOI can 
be found in the Appendix F.2 available in the electronic supplementary material and on our 
demo page. 


In addition to the new concept for addressing math stems, we also changed our approach for 
definition detection. Previously [214], we presumed that every equation symbol declares a 
definition for the left-hand side expression. This would have a significant impact on the transla- 
tion to CAS. Further, definitions must be translated differently compared to normal equations. 
Currently, there is no reliable approach available to distinguish an equation from a definition. 
Existing approaches try to classify entire textual sections in a document as definitions [111, 
134, 183, 370] but not a single formula. We will elaborate more on this matter in Section 5.2.3. 
For now, we only consider an equation symbol as a definition if it is explicitly declared as such 
via :=. 


For annotating MOls with textual descriptions, we first used a support vector machine [213] and 
later applied distance metrics [279, 329, 330] between single identifiers and textual descriptions. 
We were able to reach an F1 score of .36 for annotating single identifiers with textual descrip- 
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tions. Since we are working on more complex, less overloaded [14], MOI expressions now, we 
can presume an improvement if we apply the same approach again. Hence, we used our latest 
improvements [330] and applied some changes to annotate MOI rather than single identifiers 
with textual descriptions from the surrounding context. Originally, we considered only nouns, 
noun sequences, adjectives followed by nouns, and Wikipedia links as candidates of definiens 
(now MC) [329]. However, in the field of OPSF, such descriptions are generally insufficient. 
Hence, we include connective possessive endings and prepositions between noun phrases (see 
Appendix F.1 available in the electronic supplementary material for further details). 


Originally [329], we scored an identifier-definiens pair based on (1) the distance between the 
current identifier and its first occurrence in the document, (2) the distance (shortest path in the 
parse tree) between the definiens and the identifier, and (3) the distribution of the definiens in 
the sentence. We adopt this scoring technique for MOI and MC with slight adjustments. For 
condition (2), we declare the first noun in an MC as the representative token in the natural 
language parse tree. Therefore, (2) uses the shortest path between an MOI and the representative 
token in the parse tree. For condition (1), we need to identify the locations of MOIs throughout 
an entire document. Our dependency graph allows us to track the location of an MOI in 
the document. Hence, (1) calculates the distance of an MOI and its first occurrence isolated 
or as a dependent of another MOI in the document. In addition, we set the score to 1 if a 
combination of MOI and noun phrases match the patterns NP MOI or MOI (islare) DT? NP. 
These basic patterns have been proven to be very effective in previous experiments for extracting 
descriptions of mathematical expressions [213, 214, 279, 330]. We denote the final score of a 
fact f, i.e., of an MOI and MC pair, with syp (MOI, MC). 


4.2.6 Semantic Macro Replacement Patterns 


Now, we derive a rule r for a fact f so that the MOI m € £p can be replaced by a semantic 
enhanced version m € Lç of it. The main issue is that we are still unable to identify the stems 


of a formula. Consider we have the MOI P,\"” (2) identified as Jacobi polynomial. How do we 
know the stem of a Jacobi polynomial and that n, a, 8, and z are parameters and variables? For 
an appropriate translation, we even need to identify the right order of these arguments. There 
are two approaches, (i) we identify the definition of the formula in the article or (ii) we lookup 
a standard notation. The first approach works because with the definition, we can deduce the 
stem of a function by identifying which identifiers of the function are reused in the definition. 
For example, in Figure 4.1, we see that n, a, 3, and z appear in the definition of the Jacobi 
polynomial but not P. Hence, we can conclude that the stem of the Jacobi polynomial must 
be P,(%P) (x). There are two remaining issues with this approach. First, what if a definition 
does not exist in the same article? This happens relatively often for OPSF, since OPSF are 
well established with more or less standard notation styles. Second, as previously pointed out, 
we cannot distinguish definitions from normal equations yet. As long as there is no reliable 
approach to identify definitions, approach (i) is infeasible. As a workaround, we focus on 
approach (ii) and leave (i) for future work. 


In order to get standard notations and derive patterns of them, we use the semantic macros 
in the DLMF [260, 403]. A semantic macro is a semantically enhanced BIFX expression that 
unambiguously describes the content of the expression. Hence, we can interpret a semantic 
macro as an unambiguous operator subtree m € Ly. The rendered version of the macro (i.e., 
the normal BIFX version) is in a presentational format m € Lp. Hence, we can derive a fact- 
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Table 4.5: Mappings and likelihoods for the semantic KIEX macro of the general hypergeometric 
function in the DLMF. 


Prob. Semantic Macro LaTeX Rendered 
| \genhyperF{pari}{par2} | Q_{pari}F_{par2} 5 
ing | SE > ' 3 3 ' aF (a,b; c; z) 
ı @{vari}{var2}{var3} ı (vari; var2; var3) \ 
i i \genhyperF{par1}{par2} ' {}_{pari}F_{par2} i ab 
80.3% ! ! oF, ( 2 ;2) 
' Q@{vari}{var2}{var3} | ({vari \atop var2};var3) | c 
ı \genhyperF{pari}{par2} | {}_{par1}F_{par2} i 
0.0% à @@@{a_1,\dots,a_p}{b_1,\dots,b_q}{var3} | (var3) i oF (z) 
ı \genhyperF{par1}{par2} : 
g, ı ı {}_{par1}F_{par2} er 
0.0% ' @{a_1,\dots,a_p}{b_1,\dots,b_q}{z} s F 2 : oF 
based rule r; = m — m by finding the appropriate semantic macro for a given mathematical 


description (the MC in a fact f). The DLMF defines more than 600 different semantic macros 
for OPSF. A single semantic macro may produce multiple rendered forms, e.g., by omitting the 
parentheses around the argument in sin x. This allows for fine controlling the visualization of the 
formulae. Table 4.5 contains the four different versions for the general hypergeometric function 
(controlled by the number of @s). The last version (without variables and no @ symbol) is a special 
case, which never appears in the DLMF. However, every semantic macro is also syntactically 
valid without arguments. Note also that not every version visualizes all information that is 
encoded in a semantic macro. For example, \genhyperF{2}{1}@@@{a,b}{c}{z} omits the 
variables a, b, and c. Table 4.5 also shows the BIFX for each version of the macro. By replacing 
the arguments with wildcards, we generate a BIFX pattern m that defines a rule m — m. If 
the BIFX omits information, we fill the missing slots of m with the default arguments denoted 
in the definitions of the semantic macros. For example, the default arguments for the general 
hypergeometric function are p and q for the parameters and «,,.... Gye Diy cvey b,, and 2 for 
the variables. Hence, the last version in Table 4.5 fills up the slots for the variables with these 
default arguments (given in gray). In addition, the default arguments from the DLMF definitions 
also tell us if the argument can be a list, i.e., it may contain commas. Hence, we allow the two 
wildcards for the first two variables vari and var2 to match sequences with commas while 
the other wildcards are more restrictive and reject sequences with commas. 


Since every semantic macro in the DLMF has a description, we can retrieve semantic macros 
and also the replacement rule r +, by using the annotations in the dependency graph as search 
queries. Currently, every fact has an MLP score sy; p(f). But for each fact, we may retrieve 
multiple replacement patterns depending on how well the noun phrase (the MC) matches 
semantic macro description in the DLMF. To solve this issue, we develop a cumulated ranking 
for each fact rk( f). The first part of the ranking is the MLP score sy; p(f) that ranks the pair of 
MOI and description MC. Second, we index all DLMF replacement patterns in an Elasticsearch 
(ES)! database to search for a semantic macro for a given description. ES uses the BM25 score 
to retrieve relevant semantic macros for a given query. Hence, the second component of the 
ranking function is the ES score (normalized over all retrieved hits) for a retrieved semantic 
macro m and the given description MC: syg(f). Lastly, every semantic macro m has multiple 
rendered forms, of which some are more frequently used than others in the DLMF, see the 


“https ://github.com/elastic/elasticsearch [accessed 2021-01-01] 
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probability in Table 4.5. Hence, we score a rule rp = m — m based on its likelihood of 
use in the DLMF. We counted the different versions of each semantic macro in the DLMF to 
calculate the likelihood of use. The last two replacement patterns in the Table (the ones omitting 
information) never appear in the DLMF and have a probability of 0%. We denote this score 
as SpLmF (" f). The ranking for a fact rk( f) is simply the average over the three components 


Sunp(f),Ses(f), and SppmF (Tf). 


4.2.6.1 Common Knowledge Pattern Recognition 


Since BCasT was specifically developed for the semantics of the DLMF, it is not aware of general 
mathematical notation conventions. We fixed this issue by defining rules as part of the common 
knowledge K set of facts. We rank facts from K higher compared to facts from the article A to 
perform common knowledge pre-processing transformations prior to the facts derived from 
the article. Note that we do not presume that the following rules are always true. However, in 
the context of OPSF, we achieved better results by activating them by default and, if applicable, 
deactivating them for certain scenarios. This includes that 7 is always interpreted as the 
constant, e is Euler’s number if e is followed by a superscript (power) at least once in the 
expression, į is the imaginary unit if it does not appear in a subscript (index), y is the Euler- 
Mascheroni constant if the terms Mascheroni or Euler exists in any f € A. Note that these 
heuristics are consistent in an equation, i.e., i is never both an index and the imaginary unit 
within one equation. Further, we add rules for derivative notations, such as au where y is 
optional and d can be followed by a superscript with a numeric value. In addition, BCT 
presumes \diff{.} (e.g., for dx) after integrals indicating the end of the argument of an 
integral. Hence, we search for d or d!® followed by a letter after integrals to replace it with 
\diff{.} (see [11] for a more detailed discussion on this approach). Finally, a letter preceding 
parenthesis is tagged as a function in the parse tree, if the expression in parenthesis contains 
commas or semicolons or it does not contain arithmetic symbols, such as + or —. Note that once 
a symbol is identified as a function following this rule, it is tagged as such everywhere, regardless 
of the local situation. For example, in f (+7) = f(x) we would tag f as a function even though 
the first part f(x + 7) violates the mentioned rule. As previously mentioned, this changes the 
translation from f*(x+Pi) in Mathematica to f [x+Pi]. We provide a detailed step-by-step 
example of the translation pipeline and an interactive demo at: https: //tpami.wmflabs.org. 


This Chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License 


(http://creativecommons.org/licenses/by/4.0/). 
Cu 


*8Note the difference between normal d and the roman typestyle d. 
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This chapter primarily contributes to the research task V, i.e., evaluating the effectiveness of 
the semantification and translation system ACAS. In Section 5.1, we also extend BCAST semantic 
BIFX translations to support more mathematical operators, including sums, products, integrals, 
and limit notations. Hence, this chapter secondarily also contributes to research task IV, i.e., 
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implementing an extension of the semantification approach to provide translations to CAS. We 
evaluate BCT on two different datasets: the DLMF and Wikipedia. 


First, we evaluate ACT on the DLMF to estimate the capabilities and limitations of our rule- 
based translator on a semantic enhanced dataset. Translating formulae from the DLMF to CAS 
can be considered simpler primarily for three reasons. First, the formulae are manually enhanced 
and can be considered unambiguous in most cases. Second, the constraints of formulae are 
directly attached to equations and therefore accessible to ACAST. Lastly, parts of equations in the 
DLMF are linked to their definitions which allow to resolve substitutions and fetch additional 
constraints. This meta information is either not available or given in the surrounding context 
in Wikipedia articles which greatly harms the accessibility of this crucial data. Hence, we 
presume that we achieve the best possible translations via BCsT on the DLMF. For evaluating 
the capabilities of BCT, we perform numeric and symbolic evaluation techniques to evaluate 
a translation [3, 13]. We will further use these evaluation approaches to identify flaws in the 
DLMF and CAS computations. 


Next, we evaluate BCasT on Wikipedia as the direct successor of the previous Chapter 4. Here, 
we use the full and final version of ACAST, including every improvement that has been discussed 
throughout the thesis. Specifically, it actively uses all common knowledge pattern recognition 
techniques discussed in Section 4.2.6.1, all heuristics for detecting math operators introduced in 
Section 5.1.2, and the enhanced symbolic and numeric evaluation pipeline first outlined in [3] 
and finally elaborated in Section 5.1.3. In combination with the automatic evaluation, we are 
able to perform plausibility checks on complex mathematical formulae in Wikipedia. 


This chapter is split in two parts following two main motivations behind them. In Section 5.1, 
we elaborate the possibility to use BCT translations to automatically verify entire DML and 
CAS with one another. We specifically focus on the DLMF for our DML and Mathematica 
and Maple for our general-purpose CAS. In Section 5.2, we use the final context-sensitive 
version of BCAST introduced in Chapter 4, including every improvement introduced in the first 
Section 5.1 of this chapter, with the goal to verify equations in Wikipedia articles. This chapter 
finalizes the improvements of ACAST for semantic BIFX expressions (Section 5.1) and general 
BIEX expressions (Section 5.2). 


The content of Section 5.1 was published at the TACAS conference [8]. Some parts in Section 5.2 
have also been previously published at the CICM conference [2]. Section 5.2, as the direct 
successor of Chapter 4, is part of the aforementioned submission to the TPAMI journal [11]. 


5.1 Evaluations on the Digital Library of 
Mathematical Functions 


Digital Mathematical Library (DML) gather the knowledge and results from thousands of years 
of mathematical research. Even though pure and applied mathematics are precise disciplines, 
gathering their knowledge bases over many years results in issues which every digital library 
shares: consistency, completeness, and accuracy. Likewise, CAS! play a crucial role in the 
modern era for pure and applied mathematics, and those fields which rely on them. CAS can be 
used to simplify, manipulate, compute, and visualize mathematical expressions. Accordingly, 


‘In the sequel, the acronyms CAS and DML are used, depending on the context, interchangeably with their 
plurals. 
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modern research regularly uses DML and CAS together. Nonetheless, DML [2, 10] and CAS [20, 
100, 180] are not exempt from having bugs or errors. Duran et al. [100] even raised the rather 
dramatic question: “can we trust in [CAS]?” 


Existing comprehensive DML, such as the DLMF [98], are consistently updated and frequently 
corrected with errata”. Although each chapter of the DLMF and its print analog The NIST 
Handbook of Mathematical Functions [276] has been carefully written, edited, validated, and 
proofread over many years, errors still remain. Maintaining a DML, such as the DLMF, is a 
laborious process. Likewise, CAS are eminently complex systems, and in the case of commercial 
products, often similar to black boxes in which the magic (i.e., the computations) happens in 
opaque private code [100]. CAS, especially commercial products, are often exclusively tested 
internally during development. 


An independent examination process can improve testing and increase trust in the systems and 
libraries. Hence, we want to elaborate on the following research question. 


& Research Question 


How can digital mathematical libraries and computer algebra systems be utilized to 
improve and verify one another? 


Our initial approach for answering this question is inspired by Cohl et al. [2]. In order to verify 
a translation tool from a specific KIEX dialect to Maple , they perform symbolic and numeric 
evaluations on equations from the DLMF. This approach presumes that a proven equation in 
a DML must be also valid in a CAS. In turn, a disparity in between the DML and CAS would 
lead to an issue in the translation process. However, assuming a correct translation, a disparity 
would also indicate an issue either in the DML source or the CAS implementation. In turn, 
we can take advantage of the same approach proposed by Cohl et al. [2] to improve and even 
verify DML with CAS and vice versa. Unfortunately, previous efforts to translate mathematical 
expressions from various formats, such as BIFX [3, 10], MathML [18], or OpenMath [152], to 
CAS syntax show that the translation will be the most critical part of this verification approach. 


In this section, we elaborate on the feasibility and limitations of the translation approach from 
DML to CAS as a possible answer to our research question. We further focus on the DLMF as 
our DML and the two general-purpose CAS Maple and Mathematica for this first study. This 
relatively sharp limitation is necessary in order to analyze the capabilities of the underlying 
approach to verify commercial CAS and large DML. The DLMF uses semantic macros internally 
in order to disambiguate mathematical expressions [260, 403]. These macros help to mitigate the 
open issue of retrieving sufficient semantic information from a context to perform translations 
to formal languages [10, 18]. Further, the DLMF and general-purpose CAS have a relatively 
large overlap in coverage of special functions and orthogonal polynomials. Since many of those 
functions play a crucial role in a large variety of different research fields, we focus in this first 
study mainly on these functions. 


In particular, we extend the first version of BCT [3] to increase the number of translatable func- 
tions in the DLMF significantly. Current extensions include a new handling of constraints, the 
support for the mathematical operators: sum, product, limit, and integral, as well as overcoming 


*https://dlmf .nist.gov/errata/ [accessed 2021-05-01] 
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semantic hurdles associated with Lagrange (prime) notations commonly used for differentia- 
tion. Further, we extend its support to include Mathematica using the freely available WED? 
(hereafter, with Mathematica, we refer to the WED). These improvements allow us to cover 
a larger portion of the DLMF, increase the reliability of the translations via CAT, and allow 
for comparisons between two major general-purpose CAS for the first time, namely Maple and 
Mathematica. Finally, we provide open access to all the results contained within this paper”. 


The section is structured as follows. Section 5.1.1 explains the data in the DLMF. Section 5.1.2 
focus on the improvements of BCT that had been made to make the translation as comprehen- 
sive and reliable as possible for the upcoming evaluation. Section 5.1.3 explains the symbolic 
and numeric evaluation pipeline. We will provide an in-depth discussion of that process in 
Section 5.1.3. Subsequently, we analyze the results in Section 5.1.4. Finally, we conclude the 
findings and provide an outlook for upcoming projects in Section 5.1.5. 


Related Work Existing verification techniques for CAS often focus on specific subroutines 
or functions [45, 58, 107, 148, 180, 185, 225, 228], such as a specific theorems [218], differential 
equations [153], or the implementation of the math.h library [224]. Most common are verifica- 
tion approaches that rely on intermediate verification languages [45, 148, 153, 180, 185], such as 
Boogie [29, 225] or Why3 [41, 185], which, in turn, rely on proof assistants and theorem provers, 
such as Cog [37, 45], Isabelle [153, 167], or HOL Light [146, 148, 180]. Kaliszyk and Wiedijk [180] 
proposed on entire new CAS which is built on top of the proof assistant HOL Light so that 
each simplification step can be proven by the underlying architecture. Lewis and Wester [228] 
manually compared the symbolic computations on polynomials and matrices with seven CAS. 
Aguirregabiria et al. [20] suggested to teach students the known traps and difficulties with 
evaluations in CAS instead to reduce the overreliance on computational solutions. 


We [2] developed the aforementioned translation tool BCT, which translates expressions from 
a semantically enhanced BCAsT dialect to Maple. By evaluating the performance and accuracy of 
the translations, we were able to discover a sign-error in one the DLMF’s equations [2]. While 
the evaluation was not intended to verify the DLMF, the translations by the rule-based translator 
LCs provided sufficient robustness to identify issues in the underlying library. To the best of 
our knowledge, besides this related evaluation via BCT, there are no existing libraries or tools 
that allow for automatic verification of DML. 


5.1.1 The DLMF dataset 


In the modern era, most mathematical texts (handbooks, journal publications, magazines, 
monographs, treatises, proceedings, etc.) are written using the document preparation system 
BIFX. However, the focus of BIFX is for precise control of the rendering mechanics rather than 
for a semantic description of its content. In contrast, CAS syntax is coercively unambiguous 
in order to interpret the input correctly. Hence, a transformation tool from DML to CAS 
must disambiguate mathematical expressions. While there is an ongoing effort towards such a 
process [14, 214, 329, 402, 408], there is no reliable tool available to disambiguate mathematics 
sufficiently to date. 


https: //www.wolfram.com/engine/ [accessed 2021-05-01] 
‘https: //lacast .wmflabs.org/ [accessed 2021-10-01] 
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The DLMF contains numerous relations between functions and many other properties. It is 
written in KIEX but uses specific semantic macros when applicable [403]. These semantic 
macros represent a unique function or polynomial defined in the DLMF. Hence, the semantic 
EIEX used in the DLMF is often unambiguous. For a successful evaluation via CAS, we also 
need to utilize all requirements of an equation, such as constraints, domains, or substitutions. 
The DLMF provides this additional data too and generally in a machine-readable form [403]. 
This data is accessible via the i-boxes (information boxes next to an equation marked with the 
icon ©). If the information is not given in the attached i-box or the information is incorrect, the 
translation via ACAT would fail. The i-boxes, however, do not contain information about branch 
cuts (see Section 5.1.4.1) or constraints. Constraints are accessible if they are directly attached 
to an equation. If they appear in the text (or even a title), KCAST cannot utilize them. The test 
dataset, we are using, was generated from DLMF Version 1.0.28 (2020-09-15) and contained 
9,977 formulae with 1,505 defined symbols, 50,590 used symbols, 2,691 constraints, and 2,443 
warnings for non-semantic expressions, i.e., expressions without semantic macros [403]. Note 
that the DLMF does not provide access to the underlying BIEX source. Therefore, we added the 
source of every equation to our result dataset. 


5.1.2 Semantic LaTeX to CAS translation 


The aforementioned translator ACAST was first developed by Greiner-Petter et al. [3, 10]. They 
reported a coverage of 53.6% translations [3] for a manually selected part of the DLMF to the 
CAS Maple. Afterward, they extended KOST to perform symbolic and numeric evaluations on 
the entire DLMF and reported a coverage of 58.8% translations [2]. This version of BCT serves 
as a baseline for our improvements. They reported a success rate of ~16% for symbolic and 
~12% for numeric verifications. 


Evaluating the baseline on the entire DLMF result in a coverage of only 31.6%. Hence, we first 
want to increase the coverage of ACAST on the DLMF. To achieve this goal, we first increasing 
the number of translatable semantic macros by manually defining more translation patterns 
for special functions and orthogonal polynomials. For Maple, we increased the number from 
201 to 261. For Mathematica, we define 279 new translation patterns which enables ACAST to 
perform translations to Mathematica. Even though the DLMF uses 675 distinguished semantic 
macros, we cover ~70% of all DLMF equations with our extended list of translation patterns (see 
Zipf’s law for mathematical notations [14]). In addition, we implemented rules for translations 
that are applicable in the context of the DLMF, e.g., ignore ellipsis following floating-point 
values or \choose always refers to a binomial expression. Finally, we tackle the remaining 
issues outlined by Cohl et al. [2] which can be categorized into three groups: (i) expressions of 
which the arguments of operators are not clear, namely sums, products, integrals, and limits; 
(ii) expressions with prime symbols indicating differentiation; and (iii) expressions that contain 
ellipsis. While we solve some of the cases in Group (iii) by ignoring ellipsis following floating- 
point values, most of these cases remain unresolved. 


In the following, we first introduce the constraint handling via blueprints®. Next, we elaborate 
our solutions for (i) in Section 5.1.2.2 and (ii) in Section 5.1.2.3. 


>This subsection 5.1.2.1 was previously published by Cohl et al. [2]. 
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5.1.2.1 Constraint Handling 


Correct assumptions about variable domains are essential for CAS systems, and not surprisingly 
lead to significant improvements in the CAS ability to simplify. The DLMF provides constraint 
(variable domain) metadata for formulae, and we have extracted this formula metadata. We have 
incorporated these constraints as assumptions for the simplification process (see Section 5.1.3.1). 
Note however, that a direct translation of the constraint metadata is usually not sufficient for a 
CAS to be able to understand it. Furthermore, testing invalid values for numerical tests returns 
incorrect results (see Section 5.1.3.2). 


For instance different symbols must be interpreted differently depending on the usage. One 
must be able to interpret correctly certain notations of this kind. For instance, one must be able 
to interpret the command a,b\in A, which indicates that both variables a and b are elements 
of the set A (or more generally a_1,\dots,a_n\in A). Similar conventions are often used for 
variables being elements of other sets such as the sets of rational, real or complex numbers, or 
for subsets of those sets. 


Also, one must be able to interpret the constraints as variables in sets defined using an equals 
notation such as n=0,1,2,\dots, which indicates that the variable n is a integer greater than 
or equal to zero, or together n,m=0,1,2,\dots, both the variables n and m are elements of this 
set. Since mathematicians who write BIFX are often casual about expressions such as these, one 
should know that 0,1,2,\dots is the same as 0,1, \dots. Consistently, one must also be able 
to correctly interpret infinite sets (represented as strings) such as =1,2, \dots, =1,2,3,\dots, 
=-1,0,1,2,\dots, =0,2,4,\dots, or even =3,7,11,\dots, or =5,9,13,\dots. One must 
be able to interpret finite sets such as =1,2, =1,2,3, or =1,2,\dots,N. 


An entire language of translation of mathematical notation must be understood in order for 
CAS to be able to understand constraints. In mathematics, the syntax of constraints is often 
very compact and contains textual explanations. Translating constraints from KIEX to CAS is 
a compact task because CAS only allow precise and strict syntax formats. For example, the 
typical constraint 0 < x < 1 is invalid if directly translated to Maple, because it would need to 
be translated to two separate constraints, namely x > 0 and x < 1. 


We have improved the handling and translation of variable constraints/assumptions for simplifi- 
cation and numerical evaluation. Adding assumptions about the constrained variables improves 
the effectiveness of Maple’s simplify function. Our previous approach for constraint handling 
for numerical tests was to extract a pre-defined set of test values and to filter invalid values 
according to the constraints. Because of this strategy, there often was no longer any valid values 
remaining after the filtering. To overcome this issue, instead, we chose a single numerical value 
for a variable that appears in a pre-defined constraint. For example, if a test case contains the 
constraint 0 < x < 1, we chose x = 1. 


A naive approach for this strategy, is to apply regular expressions to identify a match between 
a constraint and a rule. However, we believed that this approach does not scale well when it 
comes to more and more pre-defined rules and more complex constraints. Hence, we used the 
POM-tagger to create blueprints of the parse trees for pre-defined rules. For the example BIEX 
constraint $0 < x < 1$, rendered as 0 < x < 1, our textual rule is given by 


0 < var < 1 ==> 1/2. 
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The parse tree for this blueprint constraint contains five tokens, where var is an alphanumerical 
token that is considered to be a placeholder for a variable. 


We can also distinguish multiple variables by adding an index to the placeholder. For example, 
the rule we generated for the mathematical BIFX constraint $x,y \in \Real$, where \Real 
is the semantic macro which represents the set of real numbers, and rendered as x, y € R, is 
given by 


vari, var2 \in \Real ==> 3/2,3/2. 


A constraint will match one of the blueprints if the number, the ordering, and the type of the 
tokens are equal. Allowed matching tokens for the variable placeholders are Latin or Greek 
letters and alphanumerical tokens. 


5.1.2.2 Parse sums, products, integrals, and limits 


Here we consider common notations for the sum, product, integral, and limit operators. For 
these operators, one may consider Mathematically Essential Operator Metadata (MEOM). For 
all these operators, the MEOM includes argument(s) and bound variable(s). The operators act 
on the arguments, which are themselves functions of the bound variable(s). For sums and 
products, the bound variables are referred to as indices. The bound variables for integrals‘ are 
called integration variables. For limits, the bound variables are continuous variables (for limits of 
continuous functions) and indices (for limits of sequences). For integrals, MEOM include precise 
descriptions of regions of integration (e.g., piecewise continuous paths/intervals/regions). For 
limits, MEOM include limit points (e.g., points in R” or C” for n€ N), as well as information 
related to whether the limit to the limit point is independent or dependent on the direction in 
which the limit is taken (e.g., one-sided limits). 


For a translation of mathematical expressions involving the KIFX commands \sum, \int, \prod, 
and \lim, we must extract the MEOM. This is achieved by (a) determining the argument of the 
operator and (b) parsing corresponding subscripts, superscripts, and arguments. For integrals, 
the MEOM may be complicated, but certainly contains the argument (function which will be 
integrated), bound (integration) variable(s) and details related to the region of integration. Bound 
variable extraction is usually straightforward since it is usually contained within a differential 
expression (infinitesimal, pushforward, differential 1-form, exterior derivative, measure, etc.), 
e.g., dz. Argument extraction is less straightforward since even though differential expressions 
are often given at the end of the argument, sometimes the differential expression appears in the 
numerator of a fraction (e.g., f foe), In which case, the argument is everything to the right of 
the \int (neglecting its subscripts and superscripts) up to and including the fraction involving 
the differential expression (which may be replaced with 1). In cases where the differential 
expression is fully to the right of the argument, then it is a termination symbol. Note that 
some scientists use an alternate notation for integrals where the differential expression appears 
immediately to the right of the integral, e.g., [dx f(x). However, this notation does not appear 
in the DLMF. If such notations are encountered, we follow the same approach that we used for 
sums, products, and limits (see Section 5.1.2.2). 


°The notion of integrals includes: antiderivatives (indefinite integrals), definite integrals, contour integrals, 
multiple (surface, volume, etc.) integrals, Riemannian volume integrals, Riemann integrals, Lebesgue integrals, 
Cauchy principal value integrals, etc. 
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Extraction of variables and corresponding MEOM The subscripts and superscripts of 
sums, products, limits, and integrals may be different for different notations and are there- 
fore challenging to parse. For integrals, we extract the bound (integration) variable from the 
differential expression. For sums and products, the upper and lower bounds may appear in 
the subscript or superscript. Parsing subscripts is comparable with the problem of parsing 
constraints [2] (which are often not consistently formulated). We overcame this complexity 
by manually defining patterns of common constraints and refer to them as blueprints (see 
Section 5.1.2.1). This blueprint pattern approach allows BCAsT to identify the MEOM in the sub- 
and superscripts. 


For our MEOM blueprints, we define three placeholders: varN for single identifiers or a list 
of identifiers (delimited by commas), numL1, and numU1, representing lower and upper bound 
expressions, respectively. In addition, for sums and products, we need to distinguish between 
including and excluding boundaries, e.g., 1 < k and 1 < k. An excluding relation, such as 
0<k<10, must be interpreted as a sum from 1 to 9. Table 5.1 shows the final set of sum/product 
subscript blueprints. 


Standard notations may not explicitly show infinity boundaries. Hence, we set the default 
boundaries to infinity. For limit expressions we need different blueprints to capture the limit 
direction. We cover the standard notations with ‘var1 \to numL*’, where * is either +, -, 
^+, ^- or absent and the different arrow-notations where \to can be either \downarrow, 
\uparrow, \searrow, or \nearrow, specifying one-sided limits. Note that the arrow-notation 
(besides \to) is not used in the DLMF and thus, has no effect on the performance of BCT 
in our evaluation. Note further that, while the blueprint approach is very flexible, it cannot 
handle every possible scenario, such as the divisor sum I p-1)]2n 1/p [98, (24.10.1)]. Proper 
translations of such complex cases may even require symbolic manipulation, which is currently 
beyond the capabilities of BCssT. 


Table 5.1: The table contains examples of the blueprints for subscripts of sums/products includ- 
ing an example expression that matches the blueprint. 


Blueprints Example 


numL1 \leq vari < var2 \leq numU1 | 0<n<k<10 
-\infty < varN < \infty | -w<n<w 
numL1 < varN < numU1 | 0<n,k<10 
numL1 \leq varN < numU1 0<k<10 
numLi1 < varN \leq numU1i | 0<n,k<10 
varN \leq numU1 | nk<N +5 
varN \in numLi | ne{1,2,3} 


varN = numLi n,k,l=1 


Identification of operator arguments Once we have extracted the bound variable for 
sums, products, and limits, we need to determine the end of the argument. We analyzed all 
sums in the DLMF and developed a heuristic that covers all the formulae in the DLMF and 
potentially a large portion of general mathematics. Let x be the extracted bound variable. For 
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sums, we consider a summand as a part of the argument if (I) it is the very first summand 
after the operation; or (II) x is an element of the current summand; or (III) x is an element 
of the following summand (subsequent to the current summand) and there is no termination 
symbol between the current summand and the summand which contains x with an equal or 
lower depth according to the parse tree (i.e., closer to the root). We consider a summand as a 
single logical construct since addition and subtraction are granted a lower operator precedence 
than multiplication in mathematical expressions. Similarly, parentheses are granted higher 
precedence and, thus, a sequence wrapped in parentheses is part of the argument if it obeys the 
rules (I-III). Summands, and such sequences, are always entirely part of sums, products, and 
limits or entirely not. 


A termination symbol always marks the end of the argument list. Termination symbols are 
relation symbols, e.g., =, 4, <, closing parentheses or brackets, e.g., ), |, or >, and other 
operators with MEOMs, if and only if, they define the same bound variable. If x is part of a 
subsequent operation, then the following operator is considered as part of the argument (as 
in (I)). However, a special condition for termination symbols is that it is only a termination 
symbol for the current chain of arguments. Consider a sum over a fraction of sums. In that 
case, we may reach a termination symbol within the fraction. However, the termination symbol 
would be deeper inside the parse tree as compared to the current list of arguments. Hence, we 
used the depth to determine if a termination symbol should be recognized or not. Consider an 
unusual notation with the binomial coefficient as an example 


n 


(Se 6u) 


k —k A 
k=0 k=0 In=ı Mmm 


This equation contains two termination symbols, marked red and green. The red termination 
symbol = is obviously for the first sum on the left-hand side of the equation. The green 
termination symbol || terminates the product to the left because both products run over the 
same bound variable m. In addition, none of the other = signs are termination symbols for the 
sum on the right-hand side of the equation because they are deeper in the parse tree and thus 
do not terminate the sum. 


Note that varN in the blueprints also matches multiple bound variable, e.g., I, kea: In such 
cases, x from above is a list of bound variables and a summand is part of the argument if one 
of the elements of x is within this summand. Due to the translation, the operation will be 
split into two preceding operations, i.e., I gea becomes } mea X rea: Figure 5.1 shows the 
extracted arguments for some example sums. The same rules apply for extraction of arguments 
for products and limits. 


Figure 5.1: Example argument identifications for sums. 
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5.1.2.3 Lagrange’s notation for differentiation and derivatives 


Another remaining issue is the Lagrange (prime) notation for differentiation, since it does not 
outwardly provide sufficient semantic information. This notation presents two challenges. First, 
we do not know with respect to which variable the differentiation should be performed. Consider 
for example the Hurwitz zeta function ((s, a) [98, §25.11]. In the case ofa differentiation (’(s, a), 
it is not clear if the function should be differentiated with respect to s or a. To remedy this 
issue, we analyzed all formulae in the DLMF which use prime notations and determined which 
variables (slots) for which functions represent the variables of the differentiation. Based on 
our analysis, we extended the translation patterns by meta information for semantic macros 
according to the slot of differentiation. For instance, in the case of the Hurwitz zeta function, the 
first slot is the slot for prime differentiation, i.e., ¢/(s,a) = ág (s,a). The identified variables 
of differentiations for the special functions in the DLMF can be considered to be the standard 
slots of differentiations, e.g., in other DML, ¢’(s, a) most likely refers to Acts, a). 


The second challenge occurs if the slot of differentiation contains complex expressions rather 
than single symbols, e.g., ¢'(s?, a). In this case, ¢’(s?, a) = 1 a) instead of 1C(s2, a). 
Since CAS often do not support derivatives with respect to complex expressions, we use the 
inbuilt substitution functions’ in the CAS to overcome this issue. To do so, we use a temporary 
variable temp for the substitution. CAS perform substitutions from the inside to the outside. 
Hence, we can use the same temporary variable temp even for nested substitutions. Table 5.2 
shows the translation performed for (’(s?, a). CAS may provide optional arguments to calculate 
the derivatives for certain special functions, e.g., Zeta(n,z,a) in Maple for the n-th derivative 
of the Hurwitz zeta function. However, this shorthand notation is generally not supported 
(e.g., Mathematica does not define such an optional parameter). Our substitution approach is 
more lengthy but also more reliable. Unfortunately, lengthy expressions generally harm the 
performance of CAS, especially for symbolic manipulations. Hence, we have a genuine interest 
in keeping translations short, straightforward and readable. Thus, the substitution translation 
pattern is only triggered if the variable of differentiation is not a single identifier. Note that this 
substitution only triggers on semantic macros. Generic functions, including prime notations, 
are still skipped. 


Table 5.2: Example translations for the prime derivative of the Hurwitz zeta function with 
respect to 3. 


System ¢'(s?, a) 
DLMF \Hurwitzzeta’@{s”2}{a} 
| subs (temp=(s)~(2), 
Maple 
diff (Zeta(0,temp,a) , temp$(1))) 
ee | D[HurwitzZetaltemp,a]l,{temp,1}] 


/ .temp->(s)” (2) 


A related problem to MEOM of sums, products, integrals, limits, and differentiations are the 
notations of derivatives. The semantic macro for derivatives \deriv{w}{x} (rendered as w) is 


"Note that Maple also support an evaluation substitution via the two-argument eval function. Since our 
substitution only triggers on semantic macros, we only use subs if the function is defined in Maple. In turn, as far 
as we know, there is no practical difference between subs and the two-argument eval in our case. 
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often used with an empty first argument to render the function behind the derivative notation, 
e.g., \deriv{}{x}\sin@{x} for 4 sin x. This leads to the same problem we faced above for 
identifying MEOMs. In this case, we use the same heuristic as we did for sums, products, and 
limits. Note that derivatives may be written following the function argument, e.g., sin(x) a, If 
we are unable to identify any following summand that contains the variable of differentiation 
before we reach a termination symbol, we look for arguments prior to the derivative according 
to the heuristic (I-III). 


Wronskians With the support of prime differentiation described above, we are also able 
to translate the Wronskian [98, (1.13.4)] to Maple and Mathematica. A translation requires 
one to identify the variable of differentiation from the elements of the Wronskian, e.g., z for 
W { Ai(z), Bi(z)} from [98, (9.2.7)]. We analyzed all Wronskians in the DLMF and discovered 
that most Wronskians have a special function in its argument—such as the example above. 
Hence, we can use our previously inserted metadata information about the slots of differentiation 
to extract the variable of differentiation from the semantic macros. If the semantic macro 
argument is a complex expression, we search for the identifier in the arguments that appear 
in both elements of the Wronskian. For example, in W{ Ai(z"), (27, a)}, we extract z as the 
variable since it is the only identifier that appears in the arguments 2° and 2? 
This approach is also used when there is no semantic macro involved, i.e., from # { 2%, 27} we 
extract z as well. If CAST extracts multiple candidates or none, it throws a translation exception. 


of the elements. 


5.1.3 Evaluation of the DLMF using CAS 


Constraints || Constraint Blueprints 


Digital Library of Mathematical Functions 


Case Analyzer 


=> Workflow —— et Case Filter 


bash 2,509 (= 37.9%) r 
Substitutions L LaCASt Numeric Test 


Translator Value Filter 


=== Constraints 
=m Success 1,910 (= 28.9%) 


== Failure — | 
% Maple %* 1,084 (= 26.3%) 
©Maplesoft, Inc. Symbolic | Numeric 
Mathematica 1,235 (= 26.2%) Evaluator | Evaluator 
Figure 5.2: The workflow of the evaluation engine and the overall results. Errors and abortions 
are not included. The generated dataset contains 9, 977 equations. In total, the case analyzer 


splits the data into 10, 930 cases of which 4, 307 cases were filtered. This sums up to a set of 
6, 623 test cases in total. 


1,357 (= 51.8%) 
u 1,784 (= 51.4%) 
j l 698 (= 26.7% 


) 
784 (= 22.6%) 


For evaluating the DLMF with Maple and Mathematica, we symbolically and numerically verify 
the equations in the DLMF with CAS. Ifa verification fails, symbolically and numerically, we 
identified an issue either in the DLMF, the CAS, or the verification pipeline. Note that an issue 
does not necessarily represent errors/bugs in the DLMF, CAS, or PCAS (see the discussion 
about branch cuts in Section 5.1.4.1). Figure 5.2 illustrates the pipeline of the evaluation engine. 
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First, we analyze every equation in the DLMF (hereafter referred to as test cases). A case 
analyzer splits multiple relations in a single line into multiple test cases. Note that only the 
adjacent relations are considered, i.e., with f(z) = g(z) = h(z), we generate two test cases 
f(z) = g(z) and g(z) = h(z) but not f(z) = h(z). In addition, expressions with + and £ are 
split accordingly, e.g., it? = e*7/? [98, (4.4.12)] is split into it! = e”"/? and i™ = e+ 7/2. The 
analyzer utilizes the attached additional information in each line, i.e., the URL in the DLMF, 
the used and defined symbols, and the constraints. If a used symbol is defined elsewhere in 
the DLMF, it performs substitutions. For example, the multi-equation [98, (9.6.2)] is split into 
six test cases and every ( is replaced by 22/2 as defined in [98, (9.6.1)]. The substitution is 
performed on the parse tree of expressions [10]. A definition is only considered as such, if 
the defining symbol is identical to the equation’s left-hand side. That means, z = (3¢ )3/2 [98, 
(9.6.10)] is not considered as a definition for ¢. Further, semantic macros are never substituted by 
their definitions. Translations for semantic macros are exclusively defined by the authors. For 
example, the equation [98, (11.5.2)] contains the Struve K,,(z) function. Since Mathematica does 
not contain this function, we defined an alternative translation to its definition H,,(z) —Y, (z) in 
[98, (11.2.5)] with the Struve function H,,(z) and the Bessel function of the second kind Y, (z), 
because both of these functions are supported by Mathematica. The second entry in Table E.2 
in Appendix E available in the electronic supplementary material shows the translation for this 
test case. 


Next, the analyzer checks for additional constraints defined by the used symbols recursively. 
The mentioned Struve K,,(z) test case [98, (11.5.2)] contains the Gamma function. Since the 
definition of the Gamma function [98, (5.2.1)] has a constraint Rz > 0, the numeric evaluation 
must respect this constraint too. For this purpose, the case analyzer first tries to link the variables 
in constraints to the arguments of the functions. For example, the constraint Rz > 0 sets a 
constraint for the first argument z of the Gamma function. Next, we check all arguments in the 
actual test case at the same position. The test case contains I (v + 1/2). In turn, the variable 
z in the constraint of the definition of the Gamma function Rz > 0 is replaced by the actual 
argument used in the test case. This adds the constraint R(v + 1/2) > 0 to the test case. This 
process is performed recursively. If a constraint does not contain any variable that is used in 
the final test case, the constraint is dropped. 


In total, the case analyzer would identify four additional constraints for the test case [98, 
(11.5.2)]°. Note that the constraints may contain variables that do not appear in the actual test 
case, such as Ry + k +1 > 0. Such constraints do not have any effect on the evaluation because 
if a constraint cannot be computed to true or false, the constraint is ignored. Unfortunately, 
this recursive loading of additional constraints may generate impossible conditions in certain 
cases, such as |I (iy)| [98, (5.4.3)]. There are no valid real values of y such that R(iy) > 0. In 
turn, every test value would be filtered out, and the numeric evaluation would not verify the 
equation. However, such cases are the minority and we were able to increase the number of 
correct evaluations with this feature. 


To avoid a large portion of incorrect calculations, the analyzer filters the dataset before trans- 
lating the test cases. We apply two filter rules to the case analyzer. First, we filter expressions 
that do not contain any semantic macros. Due to the limitations of BCssT, these expressions 
most likely result in wrong translations. Further, it filters out several meaningless expressions 


*See Table E.2 in Appendix E available in the electronic supplementary material for the applied constraints 
(including the directly attached constraint Rz > 0 and the manually defined global constraints from Figure 5.3). 
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that are not verifiable, such as z = x in [98, (4.2.4)]. The result dataset flag these cases with 
“Skipped - no semantic math’. Note that the result dataset still contains the translations for these 
cases to provide a complete picture of the DLMF. Second, we filter expressions that contain 
ellipsis’ (e.g., \cdots), approximations, and asymptotics (e.g., O(z?)) since those expressions 
cannot be evaluated with the proposed approach. Further, a definition is skipped if it is not a 
definition of a semantic macro, such as [98, (2.3.13)], because definitions without an appropriate 
counterpart in the CAS are meaningless to evaluate. Definitions of semantic macros, on the 
other hand, are of special interest and remain in the test set since they allow us to test if a 
function in the CAS obeys the actual mathematical definition in the DLMF. If the case analyzer 
(see Figure 5.2) is unable to detect a relation, i.e., split an expression on <, <, >, >, =, or 
Æ, the line in the dataset is also skipped because the evaluation approach relies on relations 
to test. After splitting multi-equations (e.g., +, F, a = b = o), filtering out all non-semantic 
expressions, non-semantic macro definitions, ellipsis, approximations, and asymptotics, we end 
up with 6, 623 test cases in total from the entire DLMF. 


After generating the test case with all constraints, we translate the expression to the CAS 
representation. Every successfully translated test case is then symbolically verified, i.e., the 
CAS tries to simplify the difference of an equation to zero. Non-equation relations simplifies 
to Booleans. Non-simplified expressions are verified numerically for manually defined test 
values, i.e., we calculate actual numeric values for both sides of an equation and check their 
equivalence. 


5.1.3.1 Symbolic Evaluation 


The symbolic evaluation was performed for Maple as described in the following (taken from [2]). 
Originally, we used the standalone Maple simplify function directly, to symbolically simplify 
translated formulae. See [26, 28, 148, 190] for other examples of where Maple and other CAS 
simplification procedures has been used elsewhere in the literature. Symbolic simplification 
is performed either on the difference or the division of the left-hand sides and the right-hand 
sides of extracted formulae. Thus the expected outcome should be respectively either a0 or 1. 
Note that other outcomes, such as other numerical outcomes, are particularly interesting, since 
these may be an indication of errors in the formulae. 


In Maple, symbolic simplifications are made using internally stored relations to other functions. 
If a simplification is available, then in practice it often has to be performed over multiple defined 
relevant relations. Often, this process fails and Maple is unable to simplify the said expression. 
We have adopted some techniques which assist Maple in this process. For example, forcing 
an expression to be converted into another specific representation, in a pre-processing step, 
could potentially improve the odds that Maple is able to recognize a possible simplification. 
By trial-and-error, we discovered (and implemented) the following pre-processing steps which 
significantly improve the simplification process: 


e conversion to exponential representation; 
e conversion to hypergeometric representation; 
e expansion of expressions (for example (x+y) ~2); and 


e combined expansion and conversion processes. 


’Note that we filter out ellipsis (e.g., \cdots) but not single dots (e.g., \cdot). 
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Figure 5.3: The ten numeric test values in the complex plane for general variables. The dashed 
line represents the unit circle |z| = 1. At the right, we show the set of values for special variable 
values and general global constraints. On the right, 7 is referring to a generic variable and not 
to the imaginary unit. 


In comparison to the original approach described in [2], we use the newer version Maple 2020 
now. Another feature we added to BCAST is the support of packages in Maple. Some functions 
are only available in modules (packages) that must be preloaded, such as QPochhammer in 
the package QDifferenceEquations!®. The general simplify method in Maple does not 
cover q-hypergeometric functions. Hence, whenever ACAS loads functions from the q-hyper- 
geometric package, the better performing QSimplify method is used. With the WED and 
the new support for Mathematica in BCT, we perform the symbolic and numeric tests for 
Mathematica as well. The symbolic evaluation in Mathematica relies on the full simplification’! 
For Maple and Mathematica, we defined the global assumptions x,y € R and k,n, m € N. 
Constraints of test cases are added to their assumptions to support simplification. Adding 
more global assumptions for symbolic computation generally harms the performance since 
CAS internally uses assumptions for simplifications. It turned out that by adding more custom 
assumptions, the number of successfully simplified expressions decreases. 


5.1.3.2 Numerical Evaluation 


Defining an accurate test set of values to analyze an equivalence can be an arbitrarily complex 
process. It would make sense that every expression is tested on specific values according to the 
containing functions. However, this laborious process is not suitable for evaluating the entire 
DML and CAS. It makes more sense to develop a general set of test values that (i) generally 
covers interesting domains and (ii) avoid singularities, branch cuts, and similar problematic 
regions. Considering these two attributes, we come up with the ten test points illustrated in 
Figure 5.3. It contains four complex values on the unit circle and six points on the real axis. The 
test values cover the general area of interest (complex values in all four quadrants, negative 
and positive real values) and avoid the typical singularities at {0, +1, +i}. In addition, several 
variables are tied to specific values for entire sections. Hence, we applied additional global 
constraints to the test cases. 


“https: //jp.maplesoft .com/support /help/Maple/view. aspx? path=QDif ferenceEquations / 
QPochhammer [accessed 2021-05-01] 
"https: //reference.wolfram.com/language/ref/FullSimplify.html [accessed 2021-05-01] 
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The numeric evaluation engine heavily relies on the performance of extracting free variables 
from an expression. Maple does not provide a function to extract free variables from an ex- 
pression. Hence, we implemented a custom method first. Variables are extracted by identifying 
all names [36]! from an expression. This will also extract constants which need to be deleted 
from the list first. Unfortunately, inbuilt functions in CAS, if available, and our custom im- 
plementation for Maple are not very reliable. Mathematica has the undocumented function 
Reduce‘ FreeVariables for this purpose. However, both systems, the custom solution in 
Maple and the inbuilt Mathematica function, have problems distinguishing free variables of 
entire expressions from the bound variables in MEOMs, e.g., integration and continuous vari- 
ables. Mathematica sometimes does not extract a variable but returns the unevaluated input 
instead. We regularly faced this issue for integrals. However, we discovered one example with- 
out integrals. For EulerE[n,0] from [98, (24.4.26)], we expected to extract {n} as the set of 
free variables but instead received a set of the unevaluated expression itself {EulerE[n,0]}"°. 
Since the extended version of BCT handles operators, including bound variables of MEOMs, 
we drop the use of internal methods in CAS and extend ACT to extract identifiers from an 
expression. During a translation process, ACAST tags every single identifier as a variable, as 
long as it is not an element of a MEOM. This simple approach proves to be very efficient 
since it is implemented alongside the translation process itself and is already more powerful as 
compared to the existing inbuilt CAS solutions. We defined subscripts of identifiers as a part of 
the identifier, e.g., z4 and z, are extracted as variables from z, + 2, rather than z. 


The general pipeline for a numeric evaluation works as follows. First, we replace all substitutions 
and extract the variables from the left- and right-hand sides of the test expression via ACAT. 
For the previously mentioned example of the Struve function [98, (11.5.2)], BCasT identifies two 
variables in the expression, v and z. According to the values in Figure 5.3, v and z are set to the 
general ten values. A numeric test contains every combination of test values for all variables. 
Hence, we generate 100 test calculations for [98, (11.5.2)]. Afterward, we filter the test values 
that violate the attached constraints. In the case of the Struve function, we end up with 25 test 
cases (see also Table E.2 in Appendix E available in the electronic supplementary material). 


In addition, we apply a limit of 300 calculations for each test case and abort a computation after 
30 seconds due to computational limitations. If the test case generates more than 300 test values, 
only the first 300 are used. Finally, we calculate the result for every remaining test value, i.e., 
we replace every variable by their value and calculate the result. The replacement is done by 
Mathematica’s ReplaceAll method because the more appropriate method With, for unknown 
reasons, does not always replace all variables by their values. We wrap test expressions in 
Normal for numeric evaluations to avoid conditional expressions, which may cause incorrect 
calculations (see Section 5.1.4.1 for a more detailed discussion of conditional outputs). After 
replacing variables by their values, we trigger numeric computation. If the absolute value of 
the result is below the defined threshold of 0.001 or true (in the case of inequalities), the test 
calculation is considered successful. A numeric test case is only considered successful if and 
only if every test calculation was successful. If a numeric test case fails, we store the information 
on which values it failed and how many of these were successful. 


A name in Maple is a sequence of one or more characters that uniquely identifies a command, file, variable, or 
other entity. 
“The bug was reported to and confirmed by Wolfram Research Version 12.0. 
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5.1.4 Results 


The translations to Maple and Mathematica, the symbolic results, the numeric computations, 
and an overview PDF of the reported bugs to Mathematica are available online". In the fol- 
lowing, we mainly focus on Mathematica because of page limitations and because Maple has 
been investigated more closely by [2]. The results for Maple are also available online. Com- 
pared to the baseline (~ 31%), our improvements doubled the amount translations (~ 62%) 
for Maple and reach ~ 71% for Mathematica. The majority of expressions that cannot be 
translated contain macros that have no adequate translation pattern to the CAS, such as the 
macros for interval Weierstrass lattice roots [98, §23.3(i)] and the multivariate hypergeometric 
function [98, (19.16.9)]. Other errors (6% for Maple and Mathematica) occur for several reasons. 
For example, out of the 418 errors in translations to Mathematica, 130 caused an error because 
the MEOM of an operator could not be extracted, 86 contained prime notations that do not 
refer to differentiations, 92 failed because of the underlying BIFX parser [402], and in 46 cases, 
the arguments of a DLMF macro could not be extracted. 


Out of 4,713 translated expressions, 1,235 (26.2%) were successfully simplified by Mathematica 
(1,084 of 4,114 or 26.3% in Maple). For Mathematica, we also count results that are equal to 0 
under certain conditions as successful (called ConditionalExpression). We identified 65 of 
these conditional results: 15 of the conditions are equal to constraints that were provided in the 
surrounding text but not in the info box of the DLMF equation; 30 were produced due to branch 
cut issues (see Section 5.1.4.1); and 20 were the same as attached in the DLMF but reformulated, 
e.g., z € C\(1, co) from [98, (25.12.2)] was reformulated to Sz 4 0 V Rz < 1. The remaining 
translated but not symbolically verified expressions were numerically evaluated for the test 
values in Figure 5.3. For the 3,474 cases, 784 (22.6%) were successfully verified numerically by 
Mathematica (698 of 2,618 or 26.7% by Maple’). For 1,784 the numeric evaluation failed. In the 
evaluation process, 655 computations timed out and 180 failed due to errors in Mathematica. Of 
the 1,784 failed cases, 691 failed partially, i.e., there was at least one successful calculation among 
the tested values. For 1,091 all test values failed. The Appendix E, available in the electronic 
supplementary material, provides a Table E.2 with the results for three sample test cases. The 
first case is a false positive evaluation because of a wrong translation. The second case is valid, 
but the numeric evaluation failed due to a bug in Mathematica (see next subsection). The last 
example is valid and was verified numerically but was too complex for symbolic verifications. 


5.1.4.1 Error Analysis 


The numeric tests’ performance strongly depends on the correct attached and utilized informa- 
tion. The example [98, (1.4.8)] from the DLMF 


df Z d (£). (5.2) 


da? da \ dz 


illustrates the difficulty of the task on a relatively easy case!*. Here, the argument of f was 
not explicitly given, such as in f(a). Hence, CAST translated f as a variable. Unfortunately, 


“https://lacast .wmflabs.org/ [accessed 2021-10-01] 

Due to computational issues, 120 cases must have been skipped manually. 292 cases resulted in an error during 
symbolic verification and, therefore, were skipped also for numeric evaluations. Considering these skipped cases as 
failures, decreases the numerically verified cases to 23% in Maple. 

16 This is the first example in Table E.2 
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this resulted in a false verification symbolically and numerically. This type of error mostly 
appears in the first three chapters of the DLMF because they use generic functions frequently. 
We hoped to skip such cases by filtering expressions without semantic macros. Unfortunately, 
this derivative notation uses the semantic macro deriv. In the future, we filter expressions that 
contain semantic macros that are not linked to a special function or orthogonal polynomial. 


As an attempt to investigate the reliability of the numeric test pipeline, we can run numeric 
evaluations on symbolically verified test cases. Since Mathematica already approved a trans- 
lation symbolically, the numeric test should be successful if the pipeline is reliable. Of the 
1,235 symbolically successful tests, only 94 (7.6%) failed numerically. None of the failed test 
cases failed entirely, i.e., for every test case, at least one test value was verified. Manually 
investigating the failed cases reveal 74 cases that failed due to an Indeterminate response 
from Mathematica and 5 returned infinity, which clearly indicates that the tested numeric 
values were invalid, e.g., due to testing on singularities. Of the remaining 15 cases, two were 
identical: [98, (15.9.2)] and [98, (18.5.9)]. This reduces the remaining failed cases to 14. We 
evaluated invalid values for 12 of these because the constraints for the values were given in the 
surrounding text but not in the info boxes. The remaining 2 cases revealed a bug in Mathematica 
regarding conditional outputs (see below). The results indicate that the numeric test pipeline is 
reliable, at least for relatively simple cases that were previously symbolically verified. The main 
reason for the high number of failed numerical cases in the entire DLMF (1,784) are due to 
missing constraints in the i-boxes and branch cut issues (see Section 5.1.4.1), i.e., we evaluated 
expressions on invalid values. 


Bug reports Mathematica has trouble with certain integrals, which, by default, generate 
conditional outputs if applicable. With the method Normal, we can suppress conditional outputs. 
However, it only hides the condition rather than evaluating the expression to a non-conditional 
output. For example, integral expressions in [98, (10.9.1)] are automatically evaluated to the 
Bessel function J,(|z|) for the condition!’ z € R rather than J,(z) for all z € C. Setting the 
GenerateConditions!? option to None does not change the output. Normal only hides z € R 
but still returns Jy(|z|). To fix this issue, for example in (10.9.1) and (10.9.4), we are forced to 
set GenerateConditions to false. 


Setting GenerateConditions to false, on the other hand, reveals severe errors in several other 
cases. Consider [°° t71e™ dt [98, (8.4.4)], which gets evaluated to T (0, z) but (condition) for 
Rz > OA Sz = 0. With GenerateConditions set to false, the integral incorrectly evaluates 
to ['(0, z) + ln(z). This happened with the 2 cases mentioned above. With the same setting, 
the difference of the left- and right-hand sides of [98, (10.43.8)] is evaluated to 0.398942 for 
x,v = 1.5. If we evaluate the same expression on z, V = 3 the result is Indeterminate due 
to infinity. For this issue, one may use NIntegrate rather than Integrate to compute the 
integral. However, evaluating via NIntegrate decreases the number of successful numeric 
evaluations in general. We have revealed errors with conditional outputs in (8.4.4), (10.22.39), 
(10.43.8-10), and (11.5.2) (in [98]). In addition, we identified one critical error in Mathematica. 
For [98, (18.17.47)], WED (Mathematica’s kernel) ran into a segmentation fault (core dumped) 
for n > 1. The kernel of the full version of Mathematica gracefully died without returning an 
output!’. 


7 Jo(x) with x € R is even. Hence, Jo(|z|) is correct under the given condition. 
Bnttps://reference.wolfram.com/language/ref/GenerateConditions.html [accessed 2021-05-01] 
All errors were reported to and confirmed by Wolfram Research. 
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Besides Mathematica, we also identified several issues in the DLMF. None of the newly identified 
issues were critical, such as the reported sign error from the previous project [2], but generally 
refer to missing or wrong attached semantic information. With the generated results, we can 
effectively fix these errors and further semantically enhance the DLMF. For example, some 
definitions are not marked as such, e.g., Q(z) = [$ e~**q(t) dt [98, (2.4.2)]. In [98, (10.24.4)], 
v must be a real value but was linked to a complex parameter and x should be positive real. An 
entire group of cases [98, (10.19.10-11)] also discovered the incorrect use of semantic macros. In 
these formulae, P, (a) and Q,(a) are defined but had been incorrectly marked up as Legendre 
functions going all the way back to DLMF Version 1.0.0 (May 7, 2010). In some cases, equations 
are mistakenly marked as definitions, e.g., [98, (9.10.10)] and [98, (9.13.1)] are annotated as 
local definitions of n. We also identified an error in BCssT, which incorrectly translated the 
exponential integrals E, (z), Ei(x) and Ein(z) (defined in [98, §6.2(i)]). A more explanatory 
overview of discovered, reported, and fixed issues in the DLMF, Mathematica, and Maple is 
provided in Appendix D available in the electronic supplementary material. 


Branch cut issues Problems that we regularly faced during evaluation are issues related to 
multi-valued functions. Multi-valued functions map values from a domain to multiple values in 
a codomain and frequently appear in the complex analysis of elementary and special functions. 
Prominent examples are the inverse trigonometric functions, the complex logarithm, or the 
square root. A proper mathematical description of multi-valued functions requires the com- 
plex analysis of Riemann surfaces. Riemann surfaces are one-dimensional complex manifolds 
associated with a multi-valued function. One usually multiplies the complex domain into a 
many-layered covering space. The correct properties of multi-valued functions on the complex 
plane may no longer be valid by their counterpart functions on CAS, e.g., (1/z)” and 1/(z”) 
for z,w € Cand z Æ 0. For example, consider z, w € C such that z # 0. Then mathemati- 
cally, (1/z)” always equals 1/(z”) (when defined) for all points on the Riemann surface with 
fixed w. However, this should certainly not be assumed to be true in CAS, unless very specific 
assumptions are adopted (e.g., w € Z, z > 0). For all modern CAS”, this equation is not true. 
Try, for instance, w = 1/2. Then (1/z)!/? — 1/z'/2 4 0 on CAS, nor for w being any other 
rational non-integer number. 


In order to compute multi-valued functions, CAS choose branch cuts for these functions so that 
they may evaluate them on their principal branches. Branch cuts may be positioned differently 
among CAS [84], e.g., arccot(-4) ~ 2.03 in Maple but is ~ —1.11 in Mathematica. This is 
certainly not an error and is usually well documented for specific CAS [108, 171]. However, 
there is no central database that summarizes branch cuts in different CAS or DML. The DLMF 
as well, explains and defines their branch cuts carefully but does not carry the information 
within the info boxes of expressions. Due to complexity, it is rather easy to lose track of 
branch cut positioning and evaluate expressions on incorrect values. For example, consider 
the equation [98, (12.7.10)]. A path of z(6) = e'® with & € [0, 27] would pass three different 
branch cuts. An accurate evaluation of the values of z(&) in CAS require calculations on the 
three branches using analytic continuation. BCT and our evaluation frequently fall into the 
same trap by evaluating values that are no longer on the principal branch used by CAS. To 
solve this issue, we need to utilize branch cuts not only for every function but also for every 
equation in the DLMF [10]. The positions of branch cuts are exclusively provided in the text 


The authors are not aware of any example of a CAS which treats multi-valued functions without adopting 
principal branches. 
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but not in the i-boxes. Adding the information to each equation in the DLMF would be a 
laborious process because a branch cut position may change according to the used values (see 
the example [98, (12.7.10)] from above). Our result data, however, would provide beneficial 
information to update, extend, and maintain the DLMF, e.g., by adding the positions of the 
branch cuts for every function. An extended discussion about branch cut issues is available in 
Appendix A available in the electronic supplementary material. 


5.1.5 Conclude Quantitative Evaluations on the DLMF 


We have presented a novel approach to verify the theoretical digital mathematical library DLMF 
with the power of two major general-purpose computer algebra systems Maple and Mathemat- 
ica. With ACasT, we transformed the semantically enhanced BIFX expressions from the DLMF 
to each CAS. Afterward, we symbolically and numerically evaluated the DLMF expressions in 
each CAS. Our results are auspicious and provide useful information to maintain and extend 
the DLMF efficiently. We further identified several errors in Mathematica, Maple [2], the DLMF, 
and the transformation tool BC\sT, proving the profit of the presented verification approach. 
Further, we provide open access to all results, including translations and evaluations?!. 


The presented results show a promising step towards an answer for our initial research question. 
By translating an equation from a DML to a CAS, automatic verifications of that equation in 
the CAS allows us to detect issues in either the DML source or the CAS implementation. Each 
analyzed failed verification successively improves the DML or the CAS. Further, analyzing a 
large number of equations from the DML may be used to finally verify a CAS. In addition, 
the approach can be extended to cover other DML and CAS by exploiting different translation 
approaches, e.g., via MathML [18] or OpenMath [152]. 


Nonetheless, the analysis of the results, especially for an entire DML, is cumbersome. Minor 
missing semantic information, e.g., a missing constraint or not respected branch cut positions, 
leads to a relatively large number of false positives, i.e., unverified expressions correct in the DML 
and the CAS. This makes a generalization of the approach challenging because all semantics of 
an equation must be taken into account for a trustworthy evaluation. Furthermore, evaluating 
equations on a small number of discrete values will never provide sufficient confidence to verify 
a formula, which leads to an unpredictable number of true negatives, i.e., erroneous equations 
that pass all tests. 


After all, we conclude that the approach provides valuable information to complement, improve, 
and maintain the DLMF, Maple, and Mathematica. A trustworthy verification, on the other 
hand, might be out of reach. 


5.1.5.1 Future Work 


The resulting dataset provides valuable information about the differences between CAS and the 
DLMF. These differences had not been largely studied in the past and are worthy of analysis. 
Especially a comprehensive and machine-readable list of branch cut positioning in different 
systems is a desired goal [84]. Hence, we will continue to work closely together with the 
editors of the DLMF to improve further and expand the available information on the DLMF. 
Finally, the numeric evaluation approach would benefit from test values dependent on the actual 
functions involved. For example, the current layout of the test values was designed to avoid 


™nttps://lacast .wmflabs.org/ [accessed 2021-10-01] 
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problematic regions, such as branch cuts. However, for identifying differences in the DLMF 
and CAS, especially for analyzing the positioning of branch cuts, an automatic evaluation of 
these particular values would be very beneficial and can be used to collect a comprehensive, 
inter-system library of branch cuts. Therefore, we will further study the possibility of linking 
semantic macros with numeric regions of interest. 


Finally, we used ACT to perform translations solely on semantic BIFX expressions. Real-world 
mathematics, however, is not available in this semantically enriched format. In the previous 
chapter, we already developed and discussed a context-sensitive extension for BCssT. This 
enables BCA to translate not only semantic KIEX formulae from the DLMF but, considering 
an informative textual context, also general mathematical expressions to multiple CAS. In the 
following section, we will evaluate this new extension of CT on Wikipedia articles. 


5.2 Evaluations on Wikipedia 


In the following, resulting from our motivation outlined in Chapter 4 - improving Wikipedia 
articles - we use Wikipedia for our test dataset to evaluate our context-sensitive extension of 
BCs. More specifically, we considered every English Wikipedia article that references to the 
DLMF via the {{d1lmf}} template”?. This should limit the domain to OPSF problems that we are 
currently examining. The English Wikipedia contains 104 such pages, of which only one page 
did not contain any formula (Spheroidal wave function)”. For the entire dataset (the remaining 
103 Wikipedia pages), we detected 6, 337 formulae in total (including potential erroneous math). 


So far, one of our initial three issues from Section 4.2.3 still remains unsolved: how can we 
determine if a translation was appropriate and complete? We called a translation appropriate, 
if the intended meaning of a presentational expression e € Lp is the same as the translated 
expression t(e, X) € Lo. However, how can we know the intended semantic meaning of a 
presentational expression e € £ p? In natural languages, the BLEU score [282] is widely used to 
judge the quality of a translation. The effectiveness of the BLEU score, however, is questionable 
when it comes to math translations due to the complexity and high interconnectedness of 
mathematical formulae. Consider, a translation of the arccotangent function arccot(x) was 
performed to arctan(1/(x)) in Maple. This translation is correct and even preferred in certain 
situations to avoid issues with so-called branch cuts (see [13, Section 3.2]). Previously, we 
developed a new approach that relies on automatic verification checks with CAS [2, 11] to 
verify a translation. This approach is very powerful for large datasets. However, it requires a 
large and precise amount of semantic data about the involved formulae, including constraints, 
domains, the position of branch cuts, and other information to reach high accuracy. In turn, we 
perform this automatic verification on the entire 103 Wikipedia pages but additionally created 
a benchmark dataset with 95 entries for qualitative analysis. To avoid issues like with the BLEU 
score, we manually evaluated each translation of the 95 test cases. 


” Templates in Wikitext are placeholders for repetitive information which get resolved by Wikitext parsers. The 
DLMF-template, for example, adds the external reference for the DLMF to the article. 

Retrieved from https : / / en . wikipedia. org / wiki / Special : WhatLinksHere by searching for 
Template:Dimf [accessed 2021-01-01] 
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5.2.1 Symbolic and Numeric Testing 


The automatic verification approach makes the assumption that a correct equation in the 
domain must remain valid in the codomain after a translation. If the equation is incorrect after 
a translation, we conclude a translation error. As we have discussed in the previous Section 5.1, 
we examined two approaches to verify an equation in a CAS. The first approach tries to 
symbolically simplify the difference of the left- and right-hand sides of an equation to zero. If 
the simplification returned zero, the equation was symbolically verified by the CAS. Symbolic 
simplifications of CAS, however, are rather limited and may even fail on simple equations. 
The second approach overcomes this issue by numerically calculating the difference between 
the left- and right-hand sides of an equation on specific numeric test values. If the difference 
is zero (or below a given threshold due to machine accuracy) for every test calculation, the 
equivalence of an equation was numerically verified. Clearly, the numeric evaluation approach 
cannot prove equivalence. However, it can prove disparity and therefore detect an error due to 
the translation. 


In the previous Section 5.1, we saw that the translations by ACAT [13] were so reliable that 
the combination of symbolic and numeric evaluations was able to detect errors in the domain 
library (i.e., the DLMF) and the codomain systems (i.e., the CAS Maple and Mathematica) [2, 
11]. Unfortunately, the number of false positives, i.e., correct equations that were not verified 
symbolically nor numerically, is relatively high. The main reason is unconsidered semantic 
information, such as constraints for specific variables or the position of branch cuts. Unconsid- 
ered semantic information causes the system to test equivalence on invalid conditions, such 
as invalid values, and therefore yields inequalities between the left- and right-hand sides of 
an equation even though the source equation and the translation were correct. Nonetheless, 
the symbolic and numeric evaluation approach proofs to be very useful also for our translation 
system. It allows us to quantitatively evaluate a large number of expressions in Wikipedia. 
In addition, it enables continuous integration testing for mathematics in Wikipedia article 
revisions. For example, an equation previously verified by the system that fails after a revision 
could indicate a poisoned revision of the article. This automatic plausibility check might be a 
jump start for the ORES system to better maintain the quality of mathematical documents [359]. 
For changes in math equations, ORES could trigger a plausibility check through our translation 
and verification pipeline and adjust the score of good faith of damaging an edit accordingly. 


5.2.2 Benchmark Testing 


To compensate for the relatively low number of verifiable equations in Wikipedia with the 
symbolic and numeric evaluation approach, we crafted a benchmark test dataset to qualitatively 
evaluate the translations. This benchmark includes a single equation (the formulae must 
contain a top-level equality symbol”, no \text, and no \color macros) randomly picked from 
each Wikipedia article from our dataset. For eight articles, no such equation was detected. 
Hence, the benchmark contains 95 test expressions. For each formula, we marked the extracted 
descriptive terms as irrelevant (0), relevant (1), or highly relevant (2), and manually translated the 
expressions to semantic KIEX and to Maple and Mathematica. If the formula contained a function 
for which no appropriate semantic macro exists, the semantic BIFX equals the generic (original) 
BIEX of this function. In 18 cases, even the human annotator was unable to appropriately 


“This excludes equality symbols of deeper levels in the parse tree, e.g., the equality symbols in sums are not 
considered as such. 
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Table 5.3: The symbolic and numeric evaluations on all 6, 337 expressions from the dataset 
with the number of translated expressions (T), the number of started test evaluations (Started), 
the success rates (Success), and the success rates on the DLMF dataset for comparison (DLMF). 
The DLMF scores refer to the results presented in the previous Section 5.1. 


Symbol Evaluation 
T Started : Success | DLMF 
Maple 4,601 | 1,747 ' .113 .264 


Mathematica 4,678 | 1,692 ' .158 262 


Numeric Evaluation 
T Started ' Success | DLMF 
Maple 4,601 | 1,627 : .181 .433 


Mathematica 4,678 | 1,516 ! .236 429 


translate the expressions to the CAS, which underlines the difficulty of the task. The main 
reason for a manual translation failure was missing information (the necessary information for 
an appropriate translation was not given in the article) or it contained elements for which an 
appropriate translation was not possible, such as contour integrals, approximations, or indefinite 
lists of arguments with dots (e.g., a,,..., a,,). Note that the domain of orthogonal polynomials 
and special functions is a well-supported domain for many general-purpose CAS, like Maple 
and Mathematica. Hence, in other domains, such as in group, number, or tensor field theory, 
we can expect a significant drop of human-translatable expressions”. Since Mathematica is 
able to import KIEX expressions, we use this import function as a baseline for our translations 
to Mathematica. We provide full access to the benchmark via our demo website and added an 
overview to Appendix F.4 available in the electronic supplementary material. 


5.2.3 Results 


First, we evaluated the 6, 337 detected formulae with our automatic evaluation via Maple and 
Mathematica. Table 5.3 shows an overview of this evaluation. With our translation pipeline, we 
were able to translate 72.6% of mathematical expressions into Maple and 73.8% into Mathemat- 
ica syntax. From these translations, around 40% were symbolically and numerically evaluated 
(the rest was filtered due to missing equation symbols, illegal characters, etc.). We were able to 
symbolically verify 11% (Maple) and 15% (Mathematica), and numerically verify 18% (Maple) 
and 24% (Mathematica). In comparison, the same tests on the manually annotated semantic 
dataset of DLMF equations [403] reached a success rate of 26% for symbolic and 43% for nu- 
meric evaluations [11] (see the previous Section 5.1). Since the DLMF is a manually annotated 
semantic dataset that provides exclusive access to constraints, substitutions, and other relevant 
information, we achieve very promising results with our context-sensitive pipeline. To test a 
theoretical continuous integration pipeline for the ORES system in Wikipedia articles, we also 
analyzed edits in math equations that have been reverted again. The Bessel’s function contains 


25Note that there are numerous specialized CAS that would cover the mentioned domains too, such as GAP [177], 
PARI/GP [283], or Cadabra [290]. 
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such an edit on the equation 


1 T 

J,(@) = = [ cos(nr — xsin T) dr. (5.3) 
T JO 

Here, the edit”® changed J,,(x) to J„WE(x). Our pipeline was able to symbolically and 

numerically verify the original expression but failed on the revision. The ORES system could 

profit from this result and adjust the score according to the automatic verification via CAS. 


5.2.3.1 Descriptive Term Extractions 


Previously, we presumed that our update of the description retrieval approach to MOI would 
yield better results. In order to check the ranking of retrieved facts, we evaluate the descriptive 
terms extractions and compare the results with our previously reported F1 scores in [330]. We 
analyze the performance for a different number of retrieved descriptions and different depths. 
Here, the depth refers to the maximum depth of in-going dependencies in the dependency 
graph to retrieve relevant descriptions. A depth value of zero does not retrieve additional terms 
from the in-going dependencies but only the noun phrases that are directly annotated to the 
formula itself. The results for relevance 1 or higher are given in Table 5.4a and for relevance 2 
in Table 5.4b. Since we need to retrieve a high number of relevant facts to achieve a complete 
translation, we are more interested in retrieving any relevant fact rather than a single but 
precise description. Hence, the performance for relevance 1 is more appropriate for our task. 
For a better comparison with our previous pipeline [330], we also analyze the performance 
only on highly relevant descriptions (relevance 2). As expected, for relevant noun phrases, 
we outperform the reported F1 score (.35). For highly relevant entries only, our updated MOI 
pipeline achieves similar results with an F1 score of .385. 


5.2.3.2 Semantification 


Since we split our translation pipeline into two steps, semantification and mapping, we evaluate 
the semantification transformations first. To do this, we use our benchmark dataset and perform 
tree comparisons of our generated transformed tree t,(e, X) and the semantically enhanced 
tree using semantic macros. The number of facts we take into account affects the performance. 
Fewer facts and the transformation might be not complete, i.e., there are still subtrees in e 
that should be already in Lg. Too many facts increase the risk of false positives, that yield 
wrong transformations. In order to estimate how many facts we need to retrieve to achieve 
a complete transformation, we evaluated the comparison on different depths D and limit the 
number of facts with the same MOL, i.e., we only consider the top-ranked facts f for an MOI 
according to Sys, p(f). In addition, we limit the number of retrieved rules r, per MC. We 
observed that an equal limit of retrieved MC per MOI and r per MC performed best. Consider 
we set the limit N to five, we would retrieve a maximum of 25 facts (five r ; for each of the five 
MC for a single MOI). Typically, the number of retrieved facts f is below this limit because 
similar MC yield similar r +. In addition, we found that considering replacement patterns with 
a likelihood of 0% (i.e., the rendered version of this macro never appears in the DLMF), harms 
performance drastically. This is because semantic macros without any arguments regularly 
match single letters, for example, I representing the gamma function with the argument (z) 


https: //en. wikipedia. org/w/index . php? dif f=991994767 Zoldid=991251002&title=Bessel_ 
function&type=revision [accessed 2021-06-23] 
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Table 5.4: Performance of description extractions via MLP for low (5.4a) and high (5.4b) rele- 
vance. In all tables, D refers to the depth (following ingoing dependencies) in the dependency 
graph, N is the maximum number of facts and r f for the same MOI, TP are true positives, and 
FP are false positives. 


(a) Relevance 1 or higher. (b) Relevance 2. 
Description Extraction Description Extraction 
D N| TP FP: Prec Rec Fi D N|TP FP: Prec Rec Fi 
0 1| 59 32) 648 .184 .286 0 1f 41 59: 451 210 .287 
0 3/136 95: 589 424 493 | 0 3| 82 149:.355 421 385 
0 6 155 150;.508 .483 .495 0 6 | 90 215: .295 .462 .360 
0 15 | 167 190: .468 .520 .493 0 15| 95 262: .266 .487 .344 
aa eee 5 ey æ 26 Ææ eit 
1 3 |179 602: .229 .558 .325 1 3 |106 675: .139 .544 .217 
1 6 | 210 1107: .159 .654 .256 1 6 |124 1193: .094 .636 .164 
“2 11122 210: 367 .379 .373 2 1] 56 227: 198 .287 .234 
2 3 | 179 600: .230 .556 .325 2 3 | 88 661: .117 .451 .186 


being omitted. Hence, we decided to consider only replacement patterns that exist in the DLMF, 
i.e., Spuur(rr) > 0. 


Since certain subtrees č C e € Lp can be already operator trees, i.e., č € Lo, we calculate 
a baseline (base) that does not perform any transformations, i.e., e = t(e, X). The baseline 
achieves a success rate of 16%. To estimate the impact of our manually defined set of common 
knowledge facts X, we also evaluated the transformations for X = K and achieve a success 
rate of 29% which is already significantly better compared to the baseline. The full pipeline, 
as described above, achieves a success rate of 48%. Table 5.5 compares the performance. The 
table shows that depth 1 outperforms depth 0, which intuitively contradicts the F1 scores in 
Table 5.4a. This underlines the necessity of the dependency graph. We further examine a drop 
in the success rate for larger N. This is attributable to the fact that g,(e) is not commutative 
and large N retrieve too many false positive facts f with high ranks. We reach the best success 
rate for depth 1 and N = 6. Increasing the depth further only has a marginal impact because, 
at depth 2, most expressions are already single identifiers, which do not provide significant 
information for the translation process. 


5.2.3.3 Translations from ATEX to CAS 


Mathematica’s ability to import TEX expressions will serve as a baseline. While Mathe- 
matica does allow to enter a textual context, it does recognize structural information in 
the expression. For example, the Jacobi polynomial PMP (x) is correctly imported as 
JacobiP [n,\[Alpha] ,\[Beta] ,x] because no other supported function in Mathematica is 
linked with this presentation. Table 5.6 compares the performance. The methods base, ck, 
full are the same as in Table 5.5, but now refer to translations to Mathematica, rather than 
semantic KIEX. Method full uses the optimal setting as shown in Table 5.5. We consider a 
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Table 5.5: Performance of semantification from KIEX to semantic BIFX. D refers to the depth 
(following ingoing dependencies) in the dependency graph, N is the maximum number of facts 
and r, for the same MOI. The methods base refers to no transformations t(e, X) = e, ck 
where X = K, and full use the full proposed pipeline. W matches the benchmark entry and 
X does not match the entry. 


Semantic LaTeX Comparison 


Method D N v x 
base = - .16 84 
e ee ae ae 
full 0 3 36 64 
0 6 40 60 
0 15 40 60 


translation a match (W) if the returned value by Mathematica equals the returned value by the 
benchmark. The internal process of Mathematica ensures that the translation is normalized. 


We observe that without further improvements, ACT already outperforms Mathematica’s 
internal import function. Activating the general replacement rules further improved perfor- 
mance. Our full context-aware pipeline achieves the best results. The relatively high ratio of 
invalid translations for full is owed to the fact that semantic macros without an appropriate 
translation to Mathematica result in an error during the translation process. The errors ensure 
that BCssT only performs translations for semantic BIFX if a translation is unambiguous and 
possible for the containing functions [13]. Note that we were not able to appropriately translate 
18 expressions (indicated by the human performance in Table 5.6) as discussed before. 


5.2.4 Error Analysis & Discussion 


In this section, we briefly summarize the main causes of errors in our translation pipeline. A 
more extensive analysis can be found in Appendix F.3 (available in the electronic supplementary 
material) and on our demo page at: https: //tpami.wmflabs. org. In the following, we 
may refer to specific benchmark entries with the associated ID. Since the benchmark contains 
randomly picked formulae from the articles, it also contains entries that might not have been 
properly annotated with math templates or math-tags in the Wikitext. Four entries in the 
benchmark (28, 43, 78, and 85) were wrongly detected by our engine and contained only parts 
of the entire formula. In the benchmark, we manually corrected these entries. Aside from the 
wrong identification, we identified other failure reasons for a translation to semantic KIEX or 
CAS. In the following, we discuss the main reasons and possible solutions to avoid them, in 
order of their impact on translation performance. 
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Table 5.6: Performance comparison for translating BIFX to Mathematica. A translation was 
successful (ST) if it was syntactically verified by Mathematica (otherwise: FT). V refers to 
matches with the benchmark and X to mismatches. The methods are explained in Section 5.2.3.3. 


LaTeX Translations to Mathematica 
Method ST FT v x 
MM_import 57(.60) 38(.40) + 9(.09) 48 (.51) 
“BCST_base 55(58) 40(42) © 11(12) 44(.46) 
BCT _ck 62(.65) 33 (.35) 19(.20) 43 (.45) 


BCsT_full  53(.56) 42(.44) | 26(.27) 27 (.26) 


Theory_def - - 1 +18 (.19)  -18 (.19) 
Theory_ck - - 143 (.03) -3 (.03) 
Human 95 (1.0) 0(.00) | 77(.81) 18 (.19) 


5.2.4.1 Defining Equations 


Recognizing an equation as a definition would have a great impact on performance. As a test, 
we manually annotated every definition in the benchmark by replacing the equal sign = with 
the unambiguous notation := and extended ACT to recognize such combination as a definition 
of the left-hand side’. This resulted in 18 more correct translations (e.g., 66, 68, and 75) and 
increased the performance from .28 to .47. The accuracy for this manual improvement is given 
as Theory_def in Table 5.6. 


The dependency graph may provide beneficial information towards a definition recognition 
system for equations. However, rather than assuming that every equation symbol indicates a 
definition [214], we propose a more selective approach. Considering one part of an equation 
(including multi-equations) as an extra MOI would establish additional dependencies in the 
dependency graph, such as a connection between x = sn(w, k) and F(x; k) = u. A combination 
with recent advances of definition recognition in NLP [111, 134, 183, 370] may then allow us to 
detect x as the defining element. The already established dependency between x and F(x; k) = 
u can finally be used to resolve the substitution. Hence, for future research, we will elaborate 
on the possibility of integrating existing NLP techniques for definition recognition [111, 134] 
into our dependency graph concept. 


5.2.4.2 Missing Information 


Another problem that causes translations to fail is missing facts. For example, the gamma 
function seems to be considered common knowledge in most articles on OPSF because it is 
often not specifically declared by name in the context (e.g., 19 or 31). To test the impact of 
considering the gamma function as common knowledge, we added a rule r; to K and attached 
a low rank to it. The low rank ensures the pattern for the gamma function will be applied 
late in the list of transformations. This indeed improved performance slightly, enabling a 
successful translation of three more benchmark entries (Theory_ck in Table 5.6). This naive 


” The DLMF did not use this notation, hence Cas was not capable of translating :— in the first place. 
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approach, emphasizes the importance of knowing the domain knowledge for specific articles. In 
combination with article classifications [320], we could activate different common knowledge 
sets depending on the specific domain. 


5.2.4.3 Non-Matching Replacement Patterns 


An issue we would more regularly faced in domains other than OPSF is non-standard nota- 
tions. As previously mentioned, without definition detection, we would not be able to derive 
transformation rules if the MOI is not given in a standard notation, such as p(a, b, n, z) for the 
Jacobi polynomial. This already happens for slight changes that are not covered by the DLMF. 
For six entries, for instance, we were unable to appropriately replace hypergeometric functions 
because they used the matrix and array environments in their arguments, while the DLMF (as 
shown in Table 4.5) only uses \atop for the same visualization. Consequently, none of our 
replacement patterns matched even though we correctly identified the expressions as hyper- 
geometric functions. A possible solution to this kind of minor representational changes might 
be to add more possible presentational variants m for a semantic macro m. Previously [14], 
we presented a search engine for MOI that allows searching for common notations for a given 
textual query. Searching for Jacobi polynomials in arXiv.org shows that different variants of 
Pio) (x) are highly related or even equivalently used, such as p, H, or R rather than P. There 
were also a couple of other minor issues we identified during the evaluation, such as synonyms 
for function names, derivative notations, or non-existent translations for semantic macros. This 
is also one of the reasons why our semantic BIEX test performed better than the translations to 
Mathematica. We provide more information on these cases on our demo page. 


Implementing the aforementioned improvements will increase the score from .26 (26 out of 95) 
to .495 (47 out of 95) for translations from KIEX to Mathematica. We achieved these results based 
on several heuristics, such as the primary identifier rules or the general replacement patterns, 
which indicates that we may improve results even further with ML algorithms. However, 
a missing properly annotated dataset and no appropriate error functions made it difficult to 
achieve promising results with ML on mathematical translation tasks in the past [1, 15]. Our 
translation pipeline based on ACT paves the way towards a baseline that can be used to train 
ML models in the future. Hence, we will focus on a hybrid approach of rule-based translations 
via ACT on the one hand, and ML-based information extraction on the other hand, to further 
push the limits of our translation pipeline. 


5.2.5 Conclude Qualitative Evaluations on Wikipedia 


We presented BCT, the first context-sensitive translation pipeline for mathematical expressions 
to the syntax of two major Computer Algebra Systems (CAS), Maple and Mathematica. We 
demonstrated that the information we need to translate is given as noun phrases in the textual 
context surrounding a mathematical formula and common knowledge databases that define 
notation conventions. We successfully extracted the crucial noun phrases via part-of-speech 
tagging. Further, we have shown that CAS can automatically verify the translated expressions 
by performing symbolic and numeric computations. In an evaluation with 104 Wikipedia articles 
in the domain of orthogonal polynomials and special functions, we verified 358 formulae using 
our approach. We identified one malicious edit with this technique, which was reverted by 
the community three days later. We have shown that ACasT correctly translates about 27% of 
mathematical formulae compared to 9% with existing approaches and a 81% human baseline. 
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Further, we demonstrated a potential successful translation rate of 46% if ACAST can identify 
definitions correctly and 49% with a more comprehensive common knowledge database. 


Our translation pipeline has several practical applications for a knowledge database like 
Wikipedia, such as improving the readability [17] and user experience [150], enabling entity 
linking for mathematics [320, 17], or allowing for automatic quality checks via CAS [2, 11]. In 
turn, we plan to integrate [401] our evaluation engine into the existing ORES system to classify 
changes in complex mathematical equations as potentially damaging or good faith. In addition, 
the system provides access to different semantic formats of a formula, such as multiple CAS 
syntaxes and semantic BIEX [260]. As shown in the DLMF [260], the semantic encoding of a 
formula can improve search results for mathematical expressions significantly. Hence, we also 
plan to add the semantic information from our mathematical dependency graph to Wikipedia’s 
math formulae to improve search results [17]. 


In future work, we aim to mitigate the issues outlined in Section 5.2.4, primarily focusing 
our efforts on definition recognitions for mathematical equations. Advances on this matter 
will enable the support for translations beyond OPSF. In particular, we plan to analyze the 
effectiveness of associating equations with their nearby context classification [111, 134, 183, 
370], assuming a defining equation is usually embedded in a definition context. Apart from 
expanding the support beyond OPSF, we further focus on improving the verification accuracy of 
the symbolic and numeric evaluation pipeline. In contrast to the evaluations on the DLMF, our 
evaluation pipeline currently disregards constraints in Wikipedia. While most constraints in the 
DLMF directly annotate specific equations, Wikipedia contains constraints in the surrounding 
context of the formula. We plan to identify constraints with new pattern matches and distance 
metrics, by assuming that constraints are often short equations (and relations) or set definitions 
and appear shortly after or before the formula they are applied to. While we made math in 
Wikipedia computable, the encyclopedia does not take advantage of this new feature yet. In 
future work, we will develop an AI [401] (as an extension to the existing ORES system) that 
makes use of this novel capability. 


This Chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License 


(http://creativecommons.org/licenses/by/4.0/). 
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This chapter summarizes and concludes the contribution of this thesis in Section 6.1 and Sec- 
tion 6.2, respectively. Section 6.3 provides an overview of future work projects. 


6.1 Summary 


In this thesis, we presented novel approaches to translate presentational mathematical en- 
codings into computable formats and to evaluate these translations. We focused on BIFX for 
the presentational encodings and Computer Algebra Systems (CAS) syntaxes for computable 
formats. Primarily, we focused on translations to the two major general-purpose CAS Maple 
and Mathematica. 


Every mathematical format serves a specific purpose and encodes different amounts of semantic 
information into an expression. A presentational format encodes visual information, while 
computable formats need to uniquely link elements with specific definitions (i.e., implementa- 
tions). There are numerous mathematical formats and conversion tools available. Many roads 
leads to Rome, thus there are several translation paths from BIFX to CAS syntaxes available, 
including direct translations via CAS import functions (see Table 1.3). The most well-covered 
conversion path between mathematical formats is between the standard encodings KIEX and 
MathML. Since content MathML explicitely encodes semantic information and many CAS are 
able to import content MathML, the easiest approach for translating BIFX to CAS was to use 
MathML as an intermediate format. Hence, we developed MathMLben, a MathML benchmark, 
to evaluate the quality of the translations of several state-of-the-art KIEX to MathML conversion 
tools. 


Supplementary Information The online version contains supplementary material available at 
https://doi.org/10.1007/978-3-658-40473-4_6. 
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Our benchmark test revealed that existing BIFX conversion tools only consider the semantic 
information that is explicitly encoded in the given expression, e.g., via visual pattern recognition 
approaches. For example, Mathematica concludes pio) (x) to be the Jacobi polynomial because 
there is no other expression with the same pattern available in Mathematica. Only three of the 
nine state-of-the-art converters supported content MathML but with insufficient accuracy. The 
conversion tool KIExML performed best and is able translate semantically enriched formulae 
in semantic BIEX. Without a manual annotation with semantic macros, however, KTExML also 
create wrong and incomplete results. In addition, even though CAS often support MathML 
(including content MathML), there is no public mapping between functions in a Content Dictio- 
nary (DC) and functions in the CAS available. Hence, a reliable import of MathML is generally 
limited to K-14! mathematics. 


Prior to this thesis, we developed BCT, a translator from semantic KIEX to the CAS Maple. 
BCT was the first translator to a CAS syntax that provided additional information about the 
translation process and provided alternative translations if a direct mapping was unavailable. 
The first version of BCAT laid the foundation to solve translation issues related to differences 
in the definitions of functions, e.g., branch cut positioning. However, BCT required manually 
crafted semantic BIFX as it is used in the DLMF. Subsequently, we focused on extending BCssT to 
perform a semantification step from KIEX to semantic BIFX based on the information gathered 
in the surrounding context of a formula. 


The semantification of mathematical expressions, even though related to other MathIR tasks, was 
new due to the information needs for a translation to computable formats. Other tasks in MathIR, 
such as the search for relevant or similar formulae, rarely need to understand the structure of 
mathematical objects in an expression. For a translation to computable formats, a conversion 
tool needs to identify the subexpressions representing a specific formula, determine which 
formula it represents, what parts of the subexpression are variable or fixed (stem), and how the 
formula is declared in the context. Existing approaches to semantically enhance mathematical 
expressions with information from a textual context can be categorized into two groups. The 
first group takes single identifiers (or other single tokens) and attaches information from the 
context to these identifiers. The second group annotates entire mathematical expressions. Both 
approaches, however, ignore informative and crucial subexpressions. 


As a first approach for a semantification process, we explored the capabilities of word embed- 
ding techniques. These models generally perform well on several natural language processing 
tasks and are able to capture co-occurrences of tokens in large corpora. These co-occurrences 
seem to model semantic relationships, as it is often shown in the infamous king-queen rela- 
tionship*. Unfortunately, we were unable to achieve similar results for math embeddings due 
to fundamental issues in existing embedding approaches. While natural language sentences are 
a sequential order of words, math formulae are deeply nested structures in which only a few 
tokens are fixed. However, distinguishing fixed from variable tokens, i.e., identifying the stem 
of a mathematical function, is context-dependent. In order to overcome these representational 
issues of mathematical expressions, we introduced a new nested concept for mathematical 
expressions, MOI. 


‘Kindergarten to early college. 
°The relationship between king and man is very similar (in terms of cosine difference between the vector 
representations) to the relationship between queen and woman. 
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Figure 6.1: Layers of a mathematical expression with mathematical objects (MOI). MOI in the 
function layer can be semantically enhanced by semantic KIEX macros. The red tokens are 
fixed tokens of the MOI and the gray tokens are variable (variables and parameters). 


A Mathematical Objects of Interest (MOI) represents a meaningful mathematical subexpression 
(math object) which might be composed of other MOI. Figure 6.1 shows different layers of 
mathematical objects within the defining formula of Jacobi polynomials. As previously men- 
tioned, most MathIR approaches focus on the context-independent elements in the expression 
or identifier layer. For translating equations from BIFX to CAS syntaxes, however, the elements 
in the layers in between both extremes are generally most crucial. If we want to translate an 
equation to the syntax of CAS, we need to primarily translate MOI in the function layer because 
those elements are mapped to unique keywords in the CAS. As an approach to explore the 
usability of the new MOI concept, we performed the first large-scaled notation study of over 
2.5 billion mathematical subexpressions in 2 million documents from arXiv and zbMATH. We 
have shown that the distribution of mathematical subexpressions is similar to words in natural 
language corpora. Following the idea that mathematical expressions are more comparable to 
sentences in natural languages, we analyzed the effectiveness of distribution scores, such as 
BM25, to retrieve MOI for given textual descriptions and achieved good results. 


Consequently, we developed a novel semantification pipeline based on the MOI concept in 
which we presume that every isolated mathematical expression in a text is considered to be 
meaningful. The connections between MOI are modeled by a mathematical dependency graph 
that links two MOI if one is a subexpression of the other (following a specific heuristic to allow 
matches between I (x) and I'(z)). Each MOI (now a node in the dependency graph) is tagged 
with descriptions extracted from the textual context. With these descriptions, we can retrieve 
semantic BIEX macros that represent the MOI. In addition, the dependency graph allows re- 
trieving semantic KIEX macros for each meaningful subexpression too. Finally, we semantically 
enhance the original BIFX expression by replacing each MOI with the correpsonding semantic 
BIFX macro. The resulted enhanced expression can be further translated to CAS syntaxes with 
BCs. Figure 6.2 shows the relevant annotations and dependencies of the defining formula of 
Jacobi polynomials in the English Wikipedia article. In order to replace BIFX with semantic 
BIFX macros, we retrieve all textual descriptions (green boxes) surrounding the formula and all 
dependent MOI (blue boxes). 
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Jacobi polynomials 
From Wikipedia, the free encyclopedia 

For Jacobi polynomials of several variables, see Heckman-Opdam polynomials. 
In mathematics Jacobi polynomial (occasionally called|hypergeometric polynomials) 
POP) (x) fare alciass o 
respect to the weight|((1 — x)*(1 + x)? bn the interval|[—1,1].|The Gegenbauer 
polynomials, and thus also the Legendre, Zemike and Chebyshev polynomials, are 
special cases of the Jacobi polynomials 


cal orthogonal polynom 


als, They are orthogonal with 


The Jacobi polynomials were introduced by Carl Gustav Jacob Jacobi 
Definitions [edit] 
Via the hypergeometric function [edit] 


Jacobi polynomialshre defined via thelhypergeometric functions follows: 


1 al 
P&P) (z) „en 2A (1+ a+ß+na+ 430-2), 


is th Pochhammer's symbol | 4 


series for th hypergeometric function is finite, thefefore one obtains the following 
equivalent expression 


f the rising factorial). In this case, the 


Figure 6.2: The annotated defining formula of Jacobi polynomials (yellow) in the English 
Wikipedia article. The defining formula depends on two other MOI (blue) in the same article: 
PP) (x) and (a + 1),,. Hence, in order to properly translate the defining formula, we need to 
translate the dependent MOI. This can be achieved by retrieving textual annotations (green) 
from the surrounding context. 


The proposed semantification approach requires a semantic KIEX macro to semantically enhance 
an MOI. The semantic macros were developed for the DLMF and mostly covered OPSF. General- 
purpose CAS, like Maple and Mathematica, natively support functions from this area in general. 
Hence, there is a significant overlap between the functions that have a semantic macro in the 
DLMF and are natively supported by CAS. Translating general expressions to CAS is often 
not possible and may require entire new subroutines in the CAS. Consider the prime counting 
function m(x) does not exist in Maple. In this case, translating 7 (x) to Maple is impossible unless 
we are able to automatically generate subroutines that are able to compute this function. Often, 
however, general functions are much simpler and may be represented by known functions, e.g., 
f(x) := sin?(x). In this case, we need to identify the definition of f(x) in order to properly 
translate it. Translating f (x)— g(x), for instance, is meaningless without knowing the definition 
of f(x) and g(x). However, determining whether an equation declares a definition remains an 
open research task for future work. 


As an alternative to the new context-sensitive translation pipeline for BCsT, we also exper- 
imented with machine translation approaches for KIEX to CAS conversions. We discovered 
that our machine translation approach is very powerful in adapting conversion rules of other 
converters, e.g., the BIFX export function of Mathematica or the conversion process by KIExML. 
Here, we achieved up to 95.1% exact match accuracy for undoing an export conversion by 
Mathematica and 90.7% accuracy for undoing a conversion by KIExmr. However, we also iden- 
tified that such machine translations are very unreliable when it comes to general mathematical 
expressions. On 100 random selected samples from the DLMF, our machine translation ap- 
proach correctly translated only 5% of the expressions, compared to 11% by Mathematica or 7$ 
by SymPy. Our rule-based translator BCsT achieved 22%. If ACAST performs translations on the 
original semantic BIFX source of the 100 samples from the DLMF, BCAST achieves 51% accuracy. 
On non-semantic enhanced cases from Wikipedia articles, our new context-sensitive version of 


Chapter 6 
Conclusion and Future Work 


Section 6.1. Summary 


KCsrT correctly translated 27% compared to the state-of-the-art 9% by Mathematica. We have 
also shown that a proper definition detection system and an improved common knowledge 
datatset would boost the number of correct translated expressions to 47%. In comparison, a 
human annotator was able to translate 81% of the expressions manually. 


For determining if a translation was correct or not, one cannot directly adapt established mea- 
sures for natural language translations. The known BLEU score, for instance, is inappropriate 
since two entire different mathematical expressions can still be equivalent. Hence, we developed 
a novel evaluation system based on the fact that a translated expression can be further computed 
by CAS. Consider an equation, which mathematicians manually proved, such as 


sin?(z) + cos*(z) = 1. (6.1) 


If the translation of this expression was correct, the equation must be valid in the syntax of the 
CAS too. Most CAS are powerful enough to verify such simple equivalence, e.g., via symbolic 
simplifications. In combination with a comprehensive library of proven equations, such as the 
DLMF, we could semantically evaluate translations by CssT. 


There is a catch to this evaluation technique. Verifying an equation to be correct can become 
arbitrarily complex (consider the infamous Riemann hypothesis or Fermat’s last conjecture, for 
example). Hence, automatically verifying an equation with CAS is limited. Nonetheless, CAS 
are powerful and flexible tools, especially when it comes to numeric evaluations. We developed 
a two-step evaluation approach to verify an equation in CAS. First, we symbolically simplify 
the difference of the left- and right-hand sides of an equation to zero. If the result is zero, the 
equation is considered symbolically verified. Second, we numerically calculate the difference 
between the left- and right-hand sides for actual numeric test values if the symbolic verification 
failed. An equation is numerically verified if the difference is close to zero for all test values 
(due to machine accuracy). While the numeric evaluation approach never proves equivalence, 
it can detect disparity. A symbolically or numerically verified equation can be considered as 
correctly translated by BCT. 


It turns out that the translations of BCT are so reliable on DLMF equations that this evaluation 
technique not only detects issues in the translation process but in the source and target systems 
as well. Consider there is an error in a test equation, such as in 


Q7"/? (cos 6) ( E ie = G + 2 ) (6.2) 


2sin 0 v+3 


The numeric evaluation would fail for most test values indicating that there was an error either 
in the source equation, i.e., the DLMF, the translator BCAST, or in the target CAS. Hence, we 
evaluated the entire DLMF with this evaluation technique and identified numerous of issues 
in the DLMF, Wikipedia, Maple, and Mathematica. Via ACAST translations and evaluations, for 
example, we identified the sign error (the red marked minus) in equation (6.2) in the DLMF [98, 
(14.5.14)]. This error was fixed with version 1.0.16 in the DLMF. Most notable error reports 
include this sign error and incorrect semantic annotations in the DLMF, wrong calculations 
for specific integrals and bugs in a variable extraction algorithm in Mathematica, incorrect 
symbolic computations in Maple, and malicious edits in Wikipedia articles°. 


3 An overview of discovered, reported, and fixed issues in CAS, DLMF, and in the Wikipedia articles is available 
in Appendix D available in the electronic supplementary material. 
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Note that, even with our novel semantification approach, ACT cannot be considered as a 
finished project (see Section 6.3). Several improvements could be achieved in the future. A 
crucial issue occur, for instance, if a function is not following the DLMF standard notation, e.g., 
p(n, a, B, x) for the Jacobi polynomial rather than Pio) (x). In that case, ACAST is incapable of 
translating the expression. There is, however, no easy solution to this problem. Such a custom 
notation raise the question about the order of the arguments. For example, in p(a, b, c, d), we 
cannot determine if c is referring to the degree of the Jacobi polynomial and should be mapped 
to the first argument in Mathematica syntax or to any other position. One possible workaround 
is to fetch and analyze the definition of p(a, b, c, d), supposed the definition is available in the 
context. By comparing the definition in the context with the actual Jacobi polynomial definition 
in the DLMF or the CAS, we could map each argument with their respective semantics, e.g., c 
to the degree of the polynomial. Such a comparison would introduce its own challenges. For 
example, what if the definition is not exactly the same as in the DLMF? Moreover, as we pointed 
out earlier, determining an equation as a defining formula is also an open research question. 
Recently, a similar issue gained interest among the NLP community with the goal to determine 
the semantic classification of paragraphs and text spans, such as definitions, theorems, or 
examples [111, 134, 183, 209, 370]. Most of the remaining issues of ACAST come along with open 
research questions. Some examples are: 


How can we distinguish an equation from a defining formula? 


How can we determine the stem of a function by a given definition? 


How can we identify constraints and their scopes in natural language contexts? 


Are there specific numeric values an equation should be tested on to increase the trust- 
worthiness in positive numeric evaluation results? 


How can a translation process overcome different branch cut positions between domain 
and co-domain representations? 


Nonetheless, BCT, in its current state, already outperforms existing presentational-to- 
computational translation solutions, improves the scientific work cycle of experimenting and 
publishing, and even helps to correct issues in DML and CAS. BCAST increases the trustwor- 
thiness in translation with a transparent communication about the translation decisions [13]. 
In combination with direct access to CAS’ kernels, ACAST also performs automatic verification 
checks on its translations, the source formula, and the system computations. This capability 
was successfully demonstrated on the DLMF in which we were able to identify numerous 
issues, from missing or incorrect semantic annotations to wrong constraints and sign errors [2]. 
With the same evaluation approach, ACT helps discover bugs in the commercial CAS, Maple 
and Mathematica [8]. In Wikipedia, ACAST computations allow for detecting malicious edits and 
the performed semantic enhancements potentially improve the readability and accessibility of 
mathematical content [11]. 


In addition, several of the projects on the way to the final version of BCT contributed towards 
multiple MathIR tasks. The developed MathML benchmark: MathMLben, for instance, is used 
for research in mathematical entity linking [321]. Our math embedding experiments enabled 
new approaches, such as centroid search queries and similarity measures for mathematical 
expressions [15, 323, 332, 404]. Our study about the frequency distributions of mathematical 
subexpressions in large corpora [14] enabled a new search engine for zbMATH [16], an auto- 
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completion for mathematical inputs, new approaches for plagiarism detection systems*, and 
literature recommendation systems that will, for the first time, take mathematical content into 
account [50]. The mathematical dependency graph generated by AC4sT can be embedded in 
Wikipedia to provide additional semantic information about a formula in a pop-up information 
window [17]. Lastly, CAST is currently planned to be integrated into future versions of the DLMF 
to provide static translations for all DLMF equations and a live interface for general expressions. 
The source of BCT is publicly available on https: //github. com/gipplab/LaCASt since 
February 2022. 


ACsT Translation Examples To conclude with the examples from the introduction of the 
thesis, ACT correctly translates every expression in Table 1.2 to Maple, Mathematica, and 
SymPy. On 100 random selected formulae from the DLMF, BCAST correctly translated 22% and 
significantly outperforms existing converters, such as Mathematica (11%), SymPy (7%), and 
machine translations (5%). For the semantic KIEX source, ACT correctly translated 51% of the 
100 samples. BCasT addresses the issues of branch cuts and differences in definitions between 
the system by providing additional information and a transparent decision process. For instance, 
arccot(z) is translated to Maple with arccot (z) but BCaT warns about the differences in the 
positioning of branch cuts and informs the user about alternative translation patterns, such 
as I/2*1n(($0-I)/($0+I)) or arctan(1/($0)). Additionally, BCT provides links to the 
definitions of the function, the domains, and the constraints, if available. By providing a textual 
context that declares PP (x) as the Jacobi polynomial and T(z) as the Gamma function, 
CAST also correctly translates equation (1.1) from the introduction. No CAS import functions 
nor alternative translations via MathML (followed by an import to the CAS) are capable of 
correctly translating equation (1.1), all expressions in Table 1.2, or m(x + y) in various contexts. 
Further, no system, besides CAT, informs the user about potential issues, such as the different 
branch cuts of arccot(z). 


To provide a more sophisticated example that underlines the capabilities of ACAST, consider 
Bailey’s transformation of very-well-poised g; from the DLMF [98, (17.9.16)] 


$ a,ga?,—qa2,b,c,d,e, f N ag 
ST | až, —a2,aq/b, aq/c,aq/d,aq/e,aq/f'"’ bedef 


(aq, aq/(de), aq/ (df), aq/ (ef); a) s( aq/ (bc), d,e, f 1) 
(aq/d, aq/e,aq/f,aq/(def);q),, “°? \aq/b,ag/c,def/a'” 

| (ag, aq/(bc), d,e, f, a?q?/(bdef),a?q?/(edef);a) _ 

" (aq/b, aq/e,aq/d, aq/e,aq/f,a2q?/(bede f), def /(aq); q) x 


N Ve a) 
“N a?q?/(bdef), a?q?/(cdef),ag?/(def) °°") 


No CAS nor other translation approaches are capable of interpreting and translating this ex- 
pression correctly with (or without) semantic annotations or textual descriptions. Mathematica, 
for example, cannot interpret leading indexes correctly, such as in g§¢ġ7, and is unable to un- 
derstand (a, b; q)„ because the multiple q-pochhammer symbol does not exist in Mathematica. 


(6.3) 


“See the DFG (German Research Foundation) fund: Analyzing Mathematics to Detect Disguised Academic Plagia- 
rism (https: //gepris.dfg.de/gepris/projekt/437179652 [accessed 2021-09-08]) 
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Since the DLMF source uses semantic macros to unambiguously describe the expression, ACAsT 
translates this complicated equation from the DLMF to Mathematica effortlessly by exploit- 
ing the definition of the multiple g-pochhammer symbol. Additionally, BCssT provides useful 
information about the internal decision process (see Figure 6.3). Outside of the DLMF, e.g., 
in Wikipedia, KACAST would require a context that explains the functions in equation (6.3) to 
properly disambiguate the components. 


A short example context that enables ACssT to properly understand equa- 
tion (6.3) 


The basic hypergeometric function 95 (e q, z) and the multiple q-pochhamer symbol 


(a, b; q)„ describes Bailey’s transformation of very-well-poised g¢,. 


In combination with this context, ACAsT identifies the function patterns and semantically en- 
hances the input expression with DLMF macros. Consequently, BCT correctly translates the 
expression to Mathematica, as it did for the original DLMF source equation, and provides the 
same useful information about the translation decisions, see Figure 6.3. Unfortunately, the 
equation is too complex for our automatic evaluation approach. 


Performing a manual translation for such significant expressions is very exhaustive and requires 
a deep understanding of the CAS. Simple mistakes, such as a sign error or a switched order of 
arguments, can lead to errors that are very difficult to detect. Additionally, even performing 
translations to appropriate counterparts in the CAS can quickly yield to undesired behaviour 
(as we haven seen for translations of arccot(—1)). By providing information about the internal 
translation decisions, ACAST translations are more trustworthy and comprehensible. BCAsT 
notifies a user about potential issues in regard of branch cut positions or questionable translation 
decisions, mitigating the chance of wrong, untracable errors. For instance, BCT is aware 
of the issue that the g-multi-pochhammer symbol is not natively supported by Mathematica 
but performs an alternative translation instead. Further, MCT sensitizes users for potential 
ambiguity issues, such as the use of abbreviations® or the ambiguity® of e. 


Translation of Bailey’s Transformation of Very-Well-Poised ¿%7 (see equa- 
tion (6.3) and [98, (17.9.16)]) 


QHypergeometricPFQ[{a, q*(a)~(Divide[1,2]) ,-q*(a)~(Divide[1,2]),b,c,d,e,f},{( 
a)“ (Divide[1,2]), -(a)~ (Divide [1,2]) ,a¥q/b,a*q/c,a*q/d,a*q/e,a*q/f},q, 
Divide [(a)~ (2) *(q)~ (2) ,b¥c*d*e*f]] 

== Divide [Product [QPochhammer [Part [{a*xq,a*q/(d*e) ,axq/(d*f) ,a*q/(e*f)},i],q, 

Infinity],{i,1, Length[{axq,a*q/(d*e) ,a*q/(d*f) ,a*q/(e*f)}]}], Product[ 

QPochhammer [Part [{a*q/d,a*q/e,a*q/f,a*q/(d*e*f)},i]l,q, Infinity],{i,1, 
Length [{a*q/d,a*q/e,a*q/f,a*q/(d*e*f)}]}]]* QHypergeometricPFQ [{a*q/(b*c 
),d,e,£},{a*q/b,a*q/c,d*e*f/a},q,ql 

+ Divide [Product [QPochhammer [Part [{a*q,a*q/(b*c) ,d,e,f, (a)~ (2) *(q)~ (2) /(b*d* 
exf) , (a) ~ (2) *(q) 7 (2) /(cxd*exf)},i],q, Infinity],{i,1, Length[{a*q,a*q/(b 
*c) ,d,e,f, (a) (2) *(q) > (2) / (bed¥ext) , (a) ~ (2) *(q) = (2) /lcrdrex£f)}]}], 
Product [QPochhammer [Part [{a*q/b,a*q/c,a*q/d,a*q/e,a*q/f, (a) ~ (2) *(q)~(2)/ 
(bec*xdxe*f) ,dxe*xf/(a*xq)},i],q, Infinity],{i,1, Length[{fa*q/b,a*q/c,a*q/ 
d,axq/e, a*q/f, (a)~ (2) *(q)~ (2)/(b*c*d*e*f),d*e*f/(a*rg)}]}]] 

* QHypergeometricPFQ [{a*q/(d*e) ,axq/(d*f), a*q/(e*f) , (a)~ (2) *(q)~ (2) / (b¥c¥d* 
ex£)},tla) (2) *(q)~ (2) / (b¥dxext) , (a) (2) *(q) > (2) /Ccxdee*t), a¥(q)~(2)/(d 
*e*f)},q,q] 


Linebreaks are manually added to improve readability. 


*An abbreviation may refer to a single variable. For instance, de f may refers to a variable definition earlier in 
the article. However, an interpretation of three individual variables (i.e., d, e, and f) is often more reasonable. 
The letter e is commonly used for the Euler’s number but can also simply refer to a Latin letter variable. 
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© Free Variables 
a,b,c,d,e,f,q 


0) Math Constant e 


You used a typical letter for a constant 


(the mathematical constant e, known 
as Napier’s constant with a value of 
2.71828182845...). We keep it like it 
is! But you should know that Mathe- 
matica uses E for this constant. If you 
want to translate it as the constant, use 
the corresponding DLMF macro \expe. 


© Abbreviation Warning 

Found a potential abbreviation: def. 
This program cannot translate abbrevi- 
ations. Hence the expression was inter- 
preted as a sequence of multiplications, 
e.8., etc -> e*t*c. 


Name: Basic hypergeometric (or g-hypergeometric) function 
Example: \qgenhyperphifr}{s} @@@f{a_1.....a_r}{b_1 ..., b_sHq}z} 


Translation Pattern: QHypergeometricPFQ[{$2}, {$3}, $4, $5] 


Relevant Links 
DLMF: http: //dimf .nist.gov/17.4#E1 
Mathematica: https: //reference.wolfram.com/language/ref /QHypergeometricPFQ.html 


© Translation Information for (x; ¢),, 


Name: q-Multi-Pochhammer symbol 
Example: \qmultiPochhammersym{a_1,\ldots,a_n}{q}{n} 


Translation pattern unavailable. Use alternative translation pattern instead. 
Alternative Translation Pattern: 

Product [QPochhammer [Part [{$0},i] ,$1,$2] ,{i,1,Length[{$0}] }] 

Relevant Links 

DLMF: http: //dlmf .nist.gov/17.2.E5 

Mathematica: unavailable 


© Translation Information for „db, 


Figure 6.3: Translation information about the translation of Bailey’s transformation of very- 
well-poised 36, to Mathematica of equation (6.3) with BCT (see also the DLMF [98, (17.9.16)]). 
Since the g-Multi-Pochhammer symbol is not natively supported in Mathematica, BCAST uses 
the alternative translation pattern based on the definition of the function [98, (17.2.5)]. The 
information about abbreviations and name of constants are fetched from the POM tagger’s 
lexicon files [402] that AC«T relies on. 
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6.2 Contributions and Impact of the Thesis 


This thesis made three main contributions: 


1. It presented a novel semantification process that replaces MOI with semantic enhanced 
BIFX macros based on information extracted from a near-by textual context and a common 
knowledge database; 


2. It demonstrated the first context-sensitive KIEX to CAS translator ACAST, which performs 
manually crafted rule-based translations to multiple CAS syntaxes from semantic KIEX 
expressions generated by the previously developed semantification process; and 


3. It showcased the efficiency and usability of CAST with a novel evaluation approach that 
symbolically and numerically verifies equations from a source database, e.g., the DLMF 
or Wikipedia, with the power of CAS. 


These contributions resulted in 14 peer-reviewed publications [1, 2, 3, 8, 9, 10, 11, 12, 13, 14, 
15, 16, 17, 18] with 2 doctoral program participations [4, 5], and 2 invited talks [6, 7]. The 
publications were 63 times cited’ overall. In addition, 3,782 commits® to a variety of different 
open source projects were performed during the time of the thesis. In the following, we briefly 
summarize the contributions of this thesis for each of the five research tasks that were defined 
in the introduction, Section 1.3. 


O Research Tasks I 


Analyze the strengths and weaknesses of existing semantification approaches for trans- 
lating mathematical expressions to computable formats. 
Contributing Publications: [1, 9, 12, 18] 


To analyze the strengths and weaknesses of existing translation tools, we performed a new 
evaluation on nine state-of-the-art KIEX to MathML converters, including Mathematica as 
CAS. We developed a new benchmark for MathML, called MathMLben, to evaluate translations 
against a manually crafted golden dataset. All converters solely rely on the semantic information 
that can be retrieved from the structure of an expression, e.g., by pattern matching approaches. 
In addition, only three converters supported content MathML with an unsatisfactory accuracy. 


The main identified weakness of all analyzed tools was the lack of taking local contextual 
information into account for the translation process. Through our evaluation, we were able 
to significantly improve KIExML translations by manually annotating BIFX expressions with 
semantic information via semantic KIEX macros. This performance improvement underlines 
the need for a semantification process that automatically performs semantic annotations based 
on information from a given context. The poor accuracy of all evaluated conversion tools 
showed, that translations from KIEX over MathML to CAS have no advantages compared to 
other translation paths, e.g., over semantic BIFX. Since, semantic BIFX translations to Maple 
were successfully implemented with the first version of ACAT, and the accuracy of KIExML 
significantly improved by semantic annotations with semantic macros, we choosed semantic 
BIFX as an intermediate format to translate expressions from KIEX to CAS syntaxes. 


7 According to Google Scholar evaluated on 2021-09-16. 
ë According to github.com evaluated on 2021-19-08. 
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O Research Tasks II 


Develop a semantification process that will improve on the weaknesses of current ap- 


proaches. Contributing Publications: [10, 14, 15] 


We accomplished this research task by developing a novel semantification process that relies 
on the textual information in the nearby context of a formula combined with a set of standard 
knowledge information. As a first attempt at creating a new common knowledge dataset, we 
studied math embeddings (i.e., word embeddings for mathematical expressions) to retrieve 
common co-occurrences between math objects and textural descriptions. This attempt was 
unsuccessful due to the flexible and nested nature of mathematical notations. Instead, we relied 
on the DLMF and the lexicon files of the POM tagger for our common knowledge database. 


To analyze the nearby textual context, we retrieve noun phrases as descriptions for mathe- 
matical objects. Since the concept of mathematical objects was barely studied in the past, we 
introduced a new concept of so-called Mathematical Objects of Interest (MOI). The idea behind 
MOI is that every mathematical subexpression is potentially meaningful. Previous research 
efforts in the MathIR area only focused either on single identifiers or entire mathematical expres- 
sions, ignoring the interconnectivity between subexpressions in math formulae. The new MOI 
concept has proven successful on a variety of different tasks in MathIR. Consequentially, we 
developed a novel semantification process based on MOI. The semantification process generates 
a mathematical dependency graph of MOI and annotates each MOI with textual descriptions 
from their textual context. The dependencies provide access to relevant descriptions of an MOI 
and its subexpression (which are also MOI). With these descriptions, we retrieve semantic 
BIFX macros from the DLMF that replace the original KIEX subexpression. This semantification 
gradually transforms the original BIFX expression into the semantically enhance semantic KIFX 
encoding. 


O Research Tasks III 


Implement a system for the automated semantification of mathematical expressions in 
scientific documents. Contributing Publications: [11, 16, 17] 


We achieved this research task by relying on the results of several previous research projects. The 
nearby textual analysis was performed with a modified version of the mathosphere system [279, 
329, 330] which was initially designed to retrieve identifier-definiens pairs from a mathematical 
text. We updated the system to retrieve facts, i.e., pairs of MOI and textual descriptions, from 
a given text. We further generated the dependency graph of MOI in a document with the 
approaches outlined by Kristianto et al. [214]. Finally, we extended the POM tagger [402] to 
create tree patterns of semantic BIEX macros from the DLMF. 


This new semantification pipeline is performed in four steps. First, we analyze a given text, 
e.g., a Wikipedia page, to identify all MOI and noun phrases. Second, we build a mathematical 
dependency graph by defining directed edges between MOI if an MOI is a subexpression of 
another MOI. Further, each MOI is annotated with noun phrases taken from the same sentence 
the MOI appears in (including subexpression appearances). Third, we use the noun phrases of 
an MOI and the noun phrases of dependant MOI to determine replacement patterns to semantic 
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DLMF BIEX macros. This replaces generic KIEX subexpressions by semantic BIFX macros. 
Fourth, the resulted semantic BIFX expression will be translated towards the target CAS syntax 
by BCAT (see the next research task). 


For this research task, we also elaborated the capabilities of machine translation techniques. 
We discovered that our sequence-to-sequence model outperforms other machine translation 
models and achieves very good scores on undoing conversions of rule-based translators, such 
as Mathematica’s BIFX export function and KIExmı translations of semantic BIFX. However, we 
also show that our machine translation are unreliable on other general mathematical expressions 
that have not been generated by Mathematica or KIExmL. We constitute that our machine 
translation model in its current form is, therefore, unsuitable for performing BIFX to CAS 
translations. 


O Research Tasks IV 


Implement an extension of the system to provide translations to computer algebra sys- 


tems. Contributing Publications: [3, 11, 13] 


We accomplished the research task IV with the previously developed translator ACAsT. BCasT 
was originally implemented as a rule-based translator for semantic BIFX expressions in the 
DLMF and solely supported Maple as a target CAS. In this thesis, we extended BCT to support 
more CAS, especially focusing our efforts on Mathematica and (more recently) on SymPy. 
Further, we implemented additional semantification heuristics in order to correctly translate 
the mathematical operators for integrals, sums, products, and limits. With a study of the prime 
notations (for derivatives) in the DLMF, we further expand the coverage of ACT translations 
specifically for functions in the DLMF. 


Lastely, we added the previously developed semantification pipeline to BCAST which finally 
turns PCAS into the first context-sensitive BIFX to CAS translator. ACT is currently able to 
parse the context of a given English Wikipedia article. However, the pipeline currently allows 
analyzing any English text document that encodes mathematical formulae in BIFX. 


O Research Tasks V 


Evaluate the effectiveness of the developed semantification and translation system. 


Contributing Publications: [2, 8, 11] 


We accomplished the research task V with a combination of a qualitative and quantitative 
evaluation pipeline. For the qualitative evaluation of BCssT, we manually crafted a benchmark 
dataset of 95 equations from English Wikipedia articles about OPSF. Cas was able to correctly 
transform BIFX into semantic BIFX for 48% of the equations and achieved 27% correct trans- 
lations to Mathematica overall. In comparison, Mathematica’s BIFX import function correctly 
imported 9% of the expressions and a human annotator was able to translate 81% of the equa- 
tions to Mathematica. We were able to show that a theoretical concept of definition detection 
and a domain-dependent common knowledge database (rather than a fixed common knowledge 
database) would increase the number of correct translations via ACssT to Mathematica from 27% 
to 49%. Performing translations from the semantic KIEX dataset DLMF underlines that the most 
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pressing issue still remains in a reliable semantification pipeline. BCssT was able to translate 
62.9% and 72% of all DLMF equations to Maple and Mathematica, respectively. To evaluate 
the semantification, we further analyzed ACAsT’s ability to retrieve relevant descriptions from 
the context of a given formula and achieved an F1 score of .495 (.508 precision and .483 recall 
respecitvely). 


Further, we developed a new concept to verify a translated expression based on the assump- 
tion that a correct equation in the source database must remain valid after translating to the 
target system. The computational ability of CAS allows us to perform verification checks on 
translated equations enable us to evaluate large datasets. In particular, we performed two novel 
approaches, symbolic and numeric evaluations. The symbolic evaluation tries to simplify the 
difference between the left- and right-hand sides of an equation to zero. The numeric evaluation 
performs actual numeric calculations on test values and numerically checks the equivalence of 
an equation’s left- and right-hand sides. On the DLMF, CT was able to symbolically verify 
26.3% and 26.2% translations to Maple and Mathematica, respectively. Symbolically unverified 
expressions were further evaluated numerically. BCssT achieved a numeric verification rate 
of 26.7% for Maple and 22.6% for Mathematica. In combination, both evaluation techniques 
verified 43.3% translations for Maple and 42.9% translations for Mathematica. Performing 
the same techniques on the Wikipedia articles resulted in an overall evaluation of 18.1% and 
23.6% for Maple and Mathematica respectively. 


The novel verification approach has proven to be very successful and even identified issues in 
the source database, i.e., Wikipedia articles and the DLMF, and bugs in the commercial target 
CAS, Maple and Mathematica. With the automatic evaluations from BCT, we identified bugs 
regarding integrals and the variable extraction function in Mathematica, discovered numerous 
minor issues in the DLMF including a sign error and incorrect semantic annotations, and 
detected a malicious edit in the Wikipedia edit history in the domain of OPSF. The errors in the 
Mathematica and the DLMF has been reported and mostly fixed’. An overview of the reports 
are available in Appendix D available in the electronic supplementary material. 


6.3 Future Work 


The research advances in MathIR and the development of BCAsT in this thesis motivates several 
follow-up projects. Current plans include to incorporate BCssT into the DLMF for providing 
translations, automatic evaluation results, and peculiarities compared to multiple CAS for 
each equation. Additionally, plans are made for including CAT as a translation-as-a-service 
endpoint. The developed semantification process is also planned to find its way into MediaWiki 
to semantically enhance mathematical content in Wikipedia pages. ACAST has not been open 
source due to its dependency to the POM tagger [402] and the semantic BIEX macros [260], 
when the research on this thesis took place. Since February 2022, the source code is publicly 
available at https: //github.com/gipplab/LaCASt. 


In this section, we provide a brief overview of four specific projects for our future work. Sec- 
tion 6.3.1 discusses ideas to improve the shortcomings of BCT and related open research 
questions that motivate follow-up projects. Section 6.3.2 discusses how we plan to improve 
existing BIFX to MathML converters with our semantification pipeline. Section 6.3.3 explains 
the Wikipedia extension for semantic enhanced mathematical expressions. This section was 


° As of 2021-10-01. 
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published as a poster together with M. Schubotz [17]. In Section 6.3.4, we discuss a potential 
multilingual support of BCT. The multilingual research project will be part of a DAAD-funded 
post-doctoral scholarship. 


6.3.1 Improved Translation Pipeline 


The performance of the presented context-sensitive translator KCAST leaves some room for 
improvements and even motivates entire new research projects. The most pressing shortcoming 
of ACAT is the lack of generalizability beyond OPSF. The main reason for this shortcoming is 
the open research task of identifying equations as definitions. Recent advances of definition 
detections in natural languages [111, 134, 183, 370] may pave the way to a reliable classification 
of mathematical equations in the near future. An equation tagged as definition enables correct 
translations of dependant formulae in the same document. This enables ACT to translate 
general functions, such as f(x), which are not directly defined in the CAS. Further, a definition 
detection of equations may help to build a comprehensive definition library across entire 
scientific corpora with numerous use cases for the mathematical community. 


Another issue that remains woefully neglected by our translation tool is the positioning of 
branch cuts for multi-valued functions. The main reason for that shortcoming is that there is 
no database or standard available to store and describe branch cuts uniformly across multiple 
systems and libraries. While branch cuts are openly discussed and presented, their description is 
often embedded in natural language text descriptions, which harms the machine readability and 
consequentially the accessibility of the information. In order to consider branch cut positions 
for a more reliable translation, we need to develop a standard to describe positions uniformly in 
a machine-readable format. Subsequently, a manual analysis across multiple CAS and libraries, 
including the DLMF, is required to build a comprehensive database that stores this information. 
Translation tools may finally use the database to either provide additional information during 
a translation process or automatically perform alternative translations based on the stored 
positioning of branch cuts. The latter, while considerately more difficult, is beneficial to improve 
the verification of equations in the DLMF further. 


Lastly, the powerful numeric evaluation approach used to verify a translated expression heavily 
relies on the chosen numeric test values. ACAST currently uses the same ten numeric test values 
for all tested equations and filters invalid combinations regarding the constraints. While easy 
to maintain for many test cases, this approach ignores function-specific attributes such as 
domains, branch cuts, singularities, and other essential characteristics. Testing functions on 
specific values of interest enable several valuable applications. For example, numeric calculations 
specifically along the defined branch cuts of the involved functions could help to automatically 
detect definition disparity on branch cuts between the systems, e.g., evaluating arccot(—1). In 
addition, testing values of interest potentially increases the trustworthiness of a numerically 
verified equation significantly. However, no study about values of interest for functions has 
been undertaken to the best of our knowledge. It might even be questioned if such values exist 
for all functions in the DLMF. Further, the value of interest may change depending on the actual 
argument of the functions. In this case, ACsT would need to automatically adjust the tested 
values accordingly, which increases the complexity of the task even further. 
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Figure 6.4: Proposed pipeline to improve existing BIFX to MathML converters. 


6.3.2 Improve LaTeX to MathML Converters 


As we have described in Section 3.3 in Chapter 3, our outlined translation pipeline can also 
be used to improve existing BIFX to MathML translators. Figure 6.4 highlights this additional 
remaining pipeline. In this thesis, we primarily focused on the main pipeline along O D (©) 
and D. However, the information we gathered in the steps © and © can also be forwarded 
to a MathML converter. In Chapter 2, we developed MathMLben, the MathML benchmark, with 
the help of KIExmL, a KIRX to XML converter. We manually added semantic annotations to the 
source expression in order to improve the conversion by KIExmr. For example, the first entry 
contains the expression about Van der Waerden numbers W (2, k). Here, we manually added 
the link to the corresponding Wikidata ID Q7913892 for W, which (together with additional 
scripts) enabled KIExmL to generate a proper, annotated content MathML representation of the 
expression. 


We can now use our semantification steps to automate the annotation process. In combination 
with existing Wikidata entity linking approaches [320, 321, 327], we can also annotate the 
original expressions with Wikidata IDs as we did manually for MathMLben. While this semantic 
enrichment process through Wikidata IDs was developed specifically for KIExmr, other BIFX to 
MathML converters can also profit from such annotations. SnuggleTeX, for example, is a BIEX 
to XML converter that allows users to pre-define the semantics of symbols in order to improve 
the so-called upconversion!" process. One option in particular is the assumeSymbol command. 


"SnuggleTeX uses this term for referring to a conversion process that requires semantic enrichment steps, e.g., 
from BIFX to content MathML or Maxima syntax. 
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Besides annotating single symbols, e.g., via 
\assumeSymbol{e}{exponentialNumber} $e$, (6.4) 
we can also define generic functions, such as 
\assumeSymmbol{f_{n_k}}{function} $f_{n_k}(x)$. (6.5) 


These pre-defined assumptions enable SnuggleTeX to perform a correct conversion to content 
MathML or the CAS Maxima. 


6.3.3 Enhanced Formulae in Wikipedia 


Recently", we deployed a feature that enables enhancing mathematical formulae in Wikipedia 
with semantics from Wikidata [308]. For instance, the wikitext code 


& Annotated Wikitext Formula 


<math qid="Q35875">E=mc”2</math> 


now connects the formula E = mc? to the corresponding Wikidata item by creating a hyperlink 
from the formula to the special page shown in Figure 6.5!*. The special page displays the 
formulae together with its name, description, and type, which the page fetches from Wikidata. 
This information is available for most formulae in all languages. Moreover, the page displays 
elements of the formula modeled as has part annotations of the Wikidata item. 


The has part annotation is not limited to individual identifiers but also applicable to complex 
terms, such as ZMgv, i.e., the kinetic energy approximation for slow velocities’. For example, 
we demonstrated using the annotation for the Grothendieck-Riemann-Roch theorem!* 


ch(fiF*)td(Y) = f,(ch(F*)td(X)). (6.6) 


The smooth quasi-projective schemes X and Y in the theorem lack Wikipedia articles. However, 
dedicated articles on quasi-projective variety and smooth scheme exist. We proposed modeling 
this situation by creating the new Wikidata item smooth quasi-projective scheme’, which links to 
the existing articles as subclasses. To create a clickable link from the Wikidata item to Wikipedia, 
we could create a new Wikipedia article on smooth quasi-projective scheme. Alternatively, we 
could add a new section on smooth quasi-projective scheme to the article on quasi-projective 
variety and create a redirect from the Wikidata item to the new section. 


Aside from implementing the new feature, defining a decision-making process for the integra- 
tion of math rendering features into Wikipedia was equally important. For this purpose, we 


"A. Greiner-Petter: Link Wikipedia Articles from Specialpage Math Formula Information, GitHub Commit to 
mediawiki-extensions-math on 27th November 2020: https : // github . com/wikimedia/mediawiki - 
extensions-Math/commit/912866b976fbdcd94fda3062244d23a34c5e7a76 

"nttps://en.wikipedia.org/wiki/Special:MathWikibase?qid=Q35875 [accessed 2021-08-18] 

Shttps://en. wikipedia. org/w/ index . php? oldid=939835125#Mass\T1\textendashvelocity_ 
relationship [accessed 2021-08-18] 

“https: //en.wikipedia.org/w/index.php?title=Special :MathWikibase&qid=Q1899432 [accessed 
2021-08-18] 

Shttps://www.wikidata. org/wiki/Q85397895 [accessed 2021-08-18] 
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founded the Wikimedia Community Group Math!‘ as an international steering committee with 
authority to decide on future features of the math rendering component of Wikipedia. 


The new feature helps Wikipedia users to better 
understand the meaning of mathematical formu- 
physical law lae by providing details on the elements of for- 


mass-energy equivalence 


mulae. Because the new feature is available in all 
language editions of Wikipedia, all users benefit 
from the improvement. Rolling out the feature for 


Math Formula Information 


Formula: E = mc? 


Name: mass-energy equivalence all languages was important to us because using 
Type: physical law Wikipedia for more in-depth investigations is sig- 
nificantly more prevalent in languages other than 


Description: mass and energy are proportionate ä a , 
measures of the same underlying property of an object English [226]. Nevertheless, also in the English 


Wikipedia, fewer than one percent of the arti- 
cles have a quality rating of good or higher [299]. 
Providing better tool support to editors can help 


Elements of the Formula 


energy E quantitative physical property transferred 
to objects to perform heating or work on 


them in raising the quality of articles. In that regard, 

mass m measure of the resistance of a physical our semantic enhancements of mathematical for- 

nn susceptibility to gravitational 146 will flank other semi-automated methods, 

speed c speed at which all massless particlesand such as recommending sections [299] and related 
of light associated fields travel in a vacuum articles [337]. 


Figure 6.5: Semantic enhancement of the To stimulate the wide-spread adoption of seman- 
formula E = mæ. tic annotations for mathematical formulae, we are 

currently working on tools that support editors in 
creating the annotations and, therefore, successively determing the ground truth of mathematics 
in Wikipedia. With AnnoMathTex [319], we are developing a tool that facilitates annotating 
mathematical formulae by providing a graphical user interface that includes machine learning 
assisted suggestions [14] for annotations. Moreover, we will integrate a field into the visual 
wikitext editor that will suggest Wikipedia authors to link the Wikidata id of a formula if the 
formula is in the Wikidata database. Improved tool support will particularly enable smaller lan- 
guage editions of Wikipedia to benefit from the new feature because the annotations performed 
in any language will be available in all languages automatically. 


Additionally, our recent advances with BCssT on the Wikipedia dataset allows us to automat- 
ically verify equations in Wikipedia to some degree. We currently working on a system that 
automatically triggers the verification engine on edits in mathematical content. This would 
allow us to generate a live feed of verified and not verified mathematical edits in the entire 
Wikipedia. While this presumably generates a lot of interesting data for numerous of projects, 
it will also serve as a proof-of-concept to integrate the system into existing quality control 
mechanisms. On the long run, we hope to integrate the verification technique into the exist- 
ing Objective Revision Evaluation Service (ORES) [144], such as other recently ermeged ORES 
extensions [359, 401]. 


nttps://meta.wikimedia.org/wiki/Wikimedia_Community_User_Group_Math [accessed 2021-08-18] 
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6.3.4 Language Independence 


The multilingual aspect of our translator becomes more and more important with the focus on 
Wikipedia. Since Wikipedia is a multilingual encyclopedia, providing a language-independent 
semantification process is a desired task. In general, the concept of our developed semantifi- 
cation approach is language independent. The pipeline relies on a POS tagger to tag tokens 
and generate parse trees of the sentences. The score of an MOI-description pair is calculated 
based on the distance between both tokens in the parse tree. Consequentially, we can presume 
that our semantification pipeline works for other languages too, as long as there is a reliable 
POS tagger for that language available. However, we already noticed minor issues with the 
well-developed CoreNLP’s POS tagger for the English language when using the MLP approach. 
As a reminder, the MLP approach suggested masking mathematical elements by placeholders 
before using a POS tagger on the sentence. For example, in the following sentence 


& Example sentence including math 


The Jacobi polynomial PA (x) is an orthogonal polynomial. 


the mathematical expressions is replaced by a placeholder MATH_1. 


& Example sentence with masked math 


The Jacobi polynomial MATH_1 is an orthogonal polynomial. 


While this approach works well in many cases, in this particular example, CoreNLP’s POS 
tagger!” tags both polynomial tokens as adjactives (JJ) while both should be tagged as nouns 
(NN). 


The underlying issue is that the MLP approach presumes math expressions to represent noun 
tokens. However, the mathematical language is generally more complex compared to that 
simple scheme [138]. This language can become quite different from general natural language 
communication. The mathematical language introduces a technical terminology with entirely 
new terms, such as ‘functor’, changes the meaning of existing vocabulary, such as ‘group’ 
or ‘ring’, and even define entire phrases to represent math concepts, such as ‘without loss of 
generality’ or ‘almost surely’. All these specifics need to be adopted by a POS tagger. Math 
notation is often part of a natural language sentence but does not necessarily represent a 
logical token. In addition, we presume that mathematical expressions are generally language- 
independent. However, its notation style may change from language to language, even for 
simple cases. For example, while the US or Germany uses > to express a greater or equal 
relation, the notation 2 is more common in Japan. Considering the sheer amount of different 
math notations, it might not be obvious to a student from Japan that > and 2 refer to the same 
relation. Yet, these symbols are so basic that most authors, even in educational literature, would 
probably not explicitly declare their meaning in the context. This issue grows with a more and 
more educated audience. For example, math educational books written for math students in 
universities rarely mention the specific meanings of logic symbols (e.g. A, V), quantifiers (e.g. 
V, J), or set notations (e.g. N and U). 


"Tested with CoreNLP’s version 4.2.2. 
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Section 6.3. Future Work 


Unfortunately, the multilingual aspects of mathematics have barely been studied in the past. 
D. Halbach [143] recently tried to take advantage of the multilingual versions of Wikipedia 
articles to identify defining formulae of that article. A defining formula of an article is the 
mathematical expression that is the main subject of that article. For example, Pio) (a) can be 
considered as the defining formula of the article about Jacobi polynomials. D. Halbach assumed 
that a mathematical expression that appears in multiple language versions of the same article 
is a good candidate for such a defining formula. Unfortunately, it turned out that different 
languages tend to use different visualizations of the same formula. For example, he showed that 
Schwarz’s theorem in the Polish, English, German and French Wikipedia articles use different 
mathematical formulae for the same concept. This result indicates that the semantification 
approach we developed in this thesis may not be easily generalized for other languages. In 
addition, there is no POS tagger available that is specialized in mathematical content. 


In combination with researchers from the National Institute of Standards and Technology (NIST) 
in the US, the National Institute of Informatics (NII) in Japan, and the University of Wuppertal 
in Germany, we plan to study the multilingual aspects of mathematical languages to analyze 
language-specific notation and declaration differences. This project is part of a post-doctoral 
DAAD scholarship and includes training a math-specific NLP model for better POS tagging of 
mathematical content articles. 


This Chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License 


(http://creativecommons.org/licenses/by/4.0/). 
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Glossary 


X={f\lfEeDUKA(fEekKsfED)} 
Our definition of a mathematical context X defined in (4.2) on page 104. A context is a 
set of facts f in a document D and a set of common knowledge facts K so that document 
facts overwrite common knowledge facts. 106, 107, 138 


Lç — Mathematical Content Languages 
Denotes mathematical content languages (CL), such as semantic KIEX, content MathML, 
or CAS syntaxes. 106, 107, 112, 134, 137, 138 


Ly — Computer Algebra System Languages 
Refers to CAS languages in general, such as the syntax of Mathematica, Maple, or SymPy 
inputs.. 107, 109, 110 


Lp — Mathematical Presentation Languages 
Denotes mathematical presentational languages (PL), such as presentation MathML or 
KIEX. 106, 107, 109-112, 134, 138 


mBM25(t, d) = max (k + 1) IDF(t) ITF(t, d) TF (t, d) 


deD TF(t,d)+k(1-5b+ NO 
vedeo P ( + Mavde) 


Our mathematical BM25 ranking to measure the importance of a given MOI t in a doc- 


ument de D which is part of a corpora D. IDF(t) is the inverse-document frequency, 
ITF(t, d) the inverse-term frequency of t in d, TF(t, d) the term frequency of t in d, 
AVGpy, the average document length (number of terms) in D, AVG¢ the average com- 
plexity of terms in D, c(t) the complexity of t, and b, k are parameters. 85 


Spi (Tf)? 
The probability score for a replacement rule r, = m — m. This score is the probability 
that m is rendered as m in the DLMF. For example, the general hypergeometric function 
never omits arguments, such as in F} (z) in the DLMF. Hence, the probability of F} (z) 
is 0. In contrast, in 19.7%, the function uses the linear rendered form „Fila, b; c; z). 114, 
137 


sps() = Sns(MLP, MC) 
The normalized Elasticsearch score for a retrieved semantic macro M for the given MC € 
f. This score is higher if MC better matches the description of the semantic macro M. 
Since ES provide absolute scores, this score is normalized to the best fitting hit, i.e., the 
first retrieved result is always scored 1. 113, 114 


surp(f) = Smtp (MLP, MC) 
The score of the MLP engine [330] for a given fact f which depends on (1) the distance 
between the MOI and its first occurrence in the document D, (2) the distance in the 


© The Author(s) 2023 
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natural language syntax tree between the MOI and the MC, and (3) if the MOI and MC 
matches pre-defined patterns. 112-114, 137 


t(e, X) = ty (t,(e, X)) 
Our translator function follows a two step strategy of which the first step is a semantifi- 
cation t,(e, X) followed by a rule-based transformation t, (e). 106, 107, 134, 138 


time) = Im, 9°" O Gr, (€) 
A rule-based translation function that performs translations on a set of rules r, € 
REN k =1,...,n from a content language C} to another content language C3. Similar 


to the semantification function, it performs graph transformations g, based on the rules. 
Example implementations are BCT or SymPy’s latex2sympy function. 106, 107 


t,(e,X) = Gf, 07770 97, (€) 
A fact-based semantification translation function takes an expression e and a context X 
to perform a series of graph transformations g, defined by the facts f to semantically 
enhance subtrees of e. 106-108, 137 


Al — Artificial Intelligence 
A broad research field with the focus on machine (artificial) intelligence. 103, 142 


AJIM — Aslib Journal of Information Management 
An international journal with an 5-year IF of 2.653 in library and information science 
with focus on information and data management. According to https: //academic- 
accelerator.com/5-Year-Impact-Factor/Aslib-Journal-of-Information- 
Management [accessed 2021-10-01] it is placed 33 of 227 journals in the field of library 
and information sciences. 9, 15, 163 


arXiv: 
Is a pre-print archive for scientific papers in a variaty of different fields, such as math- 
ematics, physics, or computer science. See arxiv. org [accessed 2021-10-01] for more 
information. 40, 62-66, 68, 70, 71, 73-75, 78-84, 86, 91, 92, 99, 101, 103, 144, 192 


arXMLiv: 
An HTMLS5 (including MathML) dataset based on the arXiv articles. The HTML5 was 
generated via KIExMmL and is available at https : //sigmathling. kwarc .info/ 
resources/arxmliv-dataset-2020/ [accessed 2021-10-01] [132]. 65, 74 


Axiom: 
Is a free, general-purpose CAS first developed by IBM around 1965 (named Scratchpad at 
that time). Since 2001, Axiom is open source under a moified BSD license and available 
on GitHub at https: //github.com/daly/axiom [accessed 2021-10-01] [173]. 5, 34, 35 


BLEU — Bilingual Evaluation Understudy 
Is an algorithm to measure the quality of translated texts first described by Papineni et 
al. [282] in 2001. The algorithm presumes the closer (more sharing n-grams) a translation 
is to human translations the better it is. 14, 99, 100, 134, 146 


Glossary 


BM25 — Okapi BM25 
Is a ranking function to calculate the relevance of results in a search engine [310]. The 
underlying idea of BM25 is that words that appear regularly only in a few documents are 
more important for that document than words that appear everywhere across the entire 
corpora. 12, 73, 83, 85, 113, 145 


CAS — Computer Algebra System(s) 
A mathematical software that allows one to work with mathematical expressions, e.g., by 
manipulating, computing, or ploting them. The acronym CAS, in this thesis, is referring 
to a single or multiple systems depending on the context. ix, xi, xii, 1-8, 10, 13-15, 19-22, 
24-36, 38, 40-43, 47, 52, 55, 58-60, 93, 95, 97, 103-108, 111, 115-120, 123-129, 131-136, 
138, 139, 141, 143-150, 152, 154-156, 158, 163-165, 168, 171, 174, 175, 180, 193 


CD — Content Dictionary 
Content dictionaries are structured documents that contain the definition of mathematical 
concepts. See the OpenMath specification for more details [53]. 23-26, 31, 57, 58, 143 


CICM — Conference on Intelligent Computer Mathematics 
An annual international conference on mathematical computation and information sys- 
tems (has a CORE rank of C since 2021). 9, 10, 15, 116 


CL — Content Language 
Content languages are languages that encode mainly semantic (content) information, 
such as content MathML, OpenMath, or CAS syntaxes. 43 


CLEF — Conference and Labs of the Evaluation Forum 
An annual international conference for systematic evaluation of information access sys- 
tems. 9 


cMML — Content MathML 
Content MathML encodes the meaning of mathematical notations. For more information 
see the explanations about MathML. 22, 23 


CORE — Computing Research and Education Association of Australasia 
Is an association of university departments that provide assessments of major conferences 
in the computing disciplines. The main categories are A* (flagship), A (excellent), B good 
to very good, and C for other ranked conferences that meet minimum standards, see 
http://portal.core.edu.au/conf-ranks/ [accessed 2021-10-01]. 8, 9 


CoreNLP: 
CoreNLP is a Java library for natural language processing tasks developed by Stanford 
NLP Group and includes tokenizer, POS taggers, lemmatizers and more [240]. 109, 110, 
160, 185, 186 


DBOW-PV — Distributed Bag-of-Words of Paragraph Vectors 
An approach to embed entire paragraphs into single vectors introduced by Le and 
Mikolov [222]. 67-69 
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DL — Deep Learning 
Is a broad family of machine learning methods that uses neural networks for learning 
features. 61 


DLMF — Digital Library of Mathematical Functions 
A digital version [98] of NIST’s Handbook of Mathematical Functions [276]. The DLMF (or 
the book respectively) is a standard reference for OPSF and provides access to numerous 
of definitions, identities, plots, and more. ix, x, xii, 1, 4, 5, 8, 12, 14, 15, 17, 25, 28, 30-33, 35, 
40, 46, 47, 49-51, 56, 58, 62, 63, 65, 66, 93-95, 97, 98, 100, 101, 103-109, 112-119, 121-126, 
129-137, 139-142, 144-156, 163-165, 168, 174-183, 190-192 


DML — Digital Mathematical Library 
A general digital library that specifically focuses on mathematics. 63, 115-118, 123, 128, 
132, 133, 148, 164 


DRMF — Digital Library of Mathematical Formulae 
An outgrowth of the DLMF project [77, 78]. 30, 32 


EMNLP — Empirical Methods in Natural Language Processing 
An annual international conference on natural language processing (has a CORE rank of 
A). 9 


ES — Elasticsearch 
A search engine written in Java that uses the open-source search engine library Apache 
Lucene, see https : / / www . elastic .co/ and https: // lucene. apache . org/ 
[accessed 2021-07-02]. 86, 88, 113, 193 


GUI — Graphical User Interface 
A visual interface that allows for interacting with data or software. 48, 49 


HTML — HyperText Markup Language 
The standard markup language for web documents. 23, 74 


ICMS — International Congress on Mathematical Software 
A bi-annual congress that gathers the mathematicians, scientists and programmers who 
are interested in the development of mathematical sofware. 9, 13, 60 


JCDL — Joint Conference on Digital Libraries 
An annual major conference in the field of digital libraries (had a CORE rank of A* until 
it was unranked in 2021 because the CORE committee removed the entire digital library 
domain from their ranking scheme). 9, 10, 14, 19, 163, 166 
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BCAsT — BIEX to CAS translator 

Is the name of the framework we developed in this thesis to translate mathematical 
BIEX to CAS. The first version of BCssT was part of the author’s Master’s thesis and 
supported translations only from semantic BIFX to Maple [3, 13]. Within this thesis, 
we extended BCT by supporting general KIEX [11] expressions and additional CAS [8], 
such as Mathematica. The source of BC\T is publicly available on https: //github. 
com/gipplab/LaCASt since February 2022.. ix—xii, 7, 8, 10, 14-17, 28-30, 32, 58, 95, 100, 
101, 105-107, 109-111, 114-119, 121, 122, 124-134, 139, 141, 144-152, 154-156, 159, 163, 
168, 171, 174, 180, 191 


BIEX: 

Is an extension of the typesetting system TEX used for document preparation. KIEX 
provides additional macros on top of TEX allowing the writer to focus more on the content 
of a document rather than on the exact layout. Since this thesis focus on mathematical 
expressions in KIEX, there is not much difference between TeX and BIFX. ix, xi, 1-3, 5-8, 
10, 13, 19-22, 24, 25, 27-35, 37-42, 45-54, 56-60, 74, 83, 88, 93, 94, 97-100, 102-108, 110, 
112, 113, 116, 118, 121, 129, 132, 135, 138-141, 143-146, 152-154, 156, 157, 166, 174-180, 
188-193 


BIEML: 
Is a tool developed by B. Miller to convert KIEX documents to a variaty of other formats, 
such as XML or HTML. The tool can also be used to transform single mathematical KIEX 
expressions to math specific formats, such as MathML, or image formats, such as SVG. 
More infomation can be found at LaTeXML: A BIFX to XML/HTML/MathML Converter, 
https: //dlmf.nist.gov/LaTeXML/ [accessed 2021-10-01]. 11, 32, 33, 38, 46-51, 53, 
58, 74, 75, 77, 78, 83, 94, 98, 102, 143, 146, 152, 154, 157 


Maple: 
One of the major general-purpose CAS [36] developed by Maplesoft. If not stated other- 
wise, we refer to the version 2020.2. ix, xii, 1, 2, 4-8, 10, 15, 20, 21, 26, 28, 31, 32, 34, 35, 
38, 43, 52, 58, 103, 104, 107-109, 115-120, 123-125, 127-136, 141, 143-145, 147-149, 152, 
154, 155, 164, 165, 168, 169, 180, 189, 193 


Mathematica: 
One of the major general-purpose CAS [393] developed by Wolfram Research. If not stated 
otherwise, we refer to version 12.1.1. ix, xii, 1-6, 8, 10, 15, 20, 21, 26, 28-31, 35, 41, 42, 52, 
97-105, 107-109, 114, 115, 117, 119, 124, 125, 127-136, 138-141, 143, 145-152, 154, 155, 
164, 169-174, 180, 181, 189, 193 


MathIR — Mathematical Information Retrieval 
Is a sub-field of the Information Retrieval (IR) research area and as such focusing on 
obtaining information (mostly semantics) or retrieving relevant mathematical expressions. 
Note that MIR is another common acronym for mathematical information retrieval. In 
this thesis, we stick with the less overloaded and more precise abbreviation MathIR. ix, 
xi, 1, 6, 8, 11, 19, 39, 40, 54, 55, 59-63, 65, 71-73, 83, 105, 144, 148, 153, 155 
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MathML — Mathematical Markup Language 

An XML structured standard for representing mathematical notations in web pages and 
other digital documents [169]. MathML allows to encode the meaning of mathematical 
notations to some degree, which is often referred to content MathML. In contrast, pre- 
sentational MathML refers only on the visual encoding of math formulae. In case a math 
formula is encoded in presentational and content MathML at the same time, it is often 
called parallel markup MathML. 2, 4, 6-8, 10-12, 19-28, 32-35, 37, 39, 41, 43-47, 49-53, 
57, 58, 62, 63, 65, 74-78, 92, 94, 105, 106, 117, 133, 143, 144, 148, 149, 152, 156-158, 166 


MathMLben — MathML Benchmark 
We developed MathMLben as a benchmark dataset for measuring the quality of MathML 
markup of mathematical formulae appearing in a textual context. See Section 2.3.2 on 
page 43 for further details. 10, 11, 45, 46, 51, 67, 94, 143, 148, 152, 157 


MATLAB: 
Is one of the major proprietary CAS with a specific focus on numeric computations 
developed by MathWorks. MATLAB is also the name of the underlying programming 
language the CAS MATLAB uses [164, 246]. 1, 5, 10, 35 


Maxima: 
Is an open source general-purpose CAS first released in 1982 (originally developed as 
a branch of the predecessor CAS Macsyma [264]) and is still actively maintained [324]. 
2-4, 28, 29, 35, 157, 158 


mBM25 — Mathematical Okapi BM25 
Our extension of the BM25 score for mathematical expressions. 85, 88-90 


MC — Mathematical Concept 
Is a term referring to the concept that defines a mathematical expression including its 
visual appearance, underlying definition, constraints, domains, and other semantic in- 
formation [9]. In the context of this thesis, we simplify this concept and presume that a 
name (or noun phrase) sufficiently specifies a concept so that the name (or noun phrase) 
is considered a representative MC. 106, 108-113, 137, 185 


MEOM — Mathematically Essential Operator Metadata 
Describes the metadata, i.e., argument(s) and bound variable(s), in sums, products, inte- 
grals, and limit operators. 120-122, 124, 128, 129 


MFS — Mathematical Functions Site 
A dataset of mathematical functions and relations maintained by Wolfram Research. The 
dataset is available at https : //functions.wolfram.com/ [accessed 2021-10-01]. 
98-102 


MKM — Mathematical Knowledge Management 
Is the general study of harvesting, maintaining, or managing mathematical information 
in literature and databases. 61, 62, 65 


ML — Machine Learning 
Is a computer science research field (often described as a subfield of artifical intelligence) 
with the relatively broad goal of making predections for unseen data based on trained 
data. 40, 61, 63, 69-71, 97, 103 
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MLP — Mathematical Language Processing 
Mathematical language processing describes to the technical process of analyzing math- 
ematical texts. A specific MLP task is the mapping of textual descriptions to components 
of mathematical formulae (see Schubotz et al. [279]), such as mathematical identifier. 61, 
62, 65, 72, 110, 137, 160, 185, 186, 188 


MOI — Mathematical Objects of Interest 
Is a term referring to subexpressions in mathematical formulae with a specific meaning [9]. 
One can consider these parts as elements of general interest. 12, 13, 60, 73, 76, 86, 91-94, 
106, 108-113, 136-138, 140, 144-146, 152-154, 160, 185-188, 191, 192 


NIST — National Institute of Standards and Technology 
An US government research institution. 30, 86, 161 


NLP — Natural Language Processing 
Is a research field with the focus on analyzing and processing natural languages in texts, 
images, videos, or audio formats. In this thesis, we mainly refer to natural language 
processing on texts rather than other multimedia formats. 39, 61, 64, 65, 72, 148, 161 


NN — Neural Network 
A graph network that aims to mathematically mimic biological neural networks. 61 


OCR — Optical Character Recognition 


Is a research field that focuses on identifying text and other symbols in images or videos. 
28, 39, 99 


OMDoc — Open Mathematical Document 
Is a markup format developed by Michael Kohlhase [198] to describe mathematical docu- 
ments. 22, 23, 26, 27, 32, 33, 36 


OpenMath: 
Is a markup language similar to MathML which uses an XML format to encode semantic 
information of mathematical expressions. The standard is maintained by the OpenMath 
Society. See http: //openmath.org/ [accessed 2021-10-01] for more information. 6, 7, 
19, 21-27, 34-37, 41, 58, 62, 106, 117, 133 


OPSF — Orthogonal Polynomials and Special Functions 
The set of orthogonal polynomials and special functions. Special functions are functions 
that, due to their general importance in certain fields, have specific names and standard 
notations. Note that there is no formal definition of the term special function. The 
NIST Handbook of Mathematical Functions [276] is a standard resource that covers a 
comprehensive set of functions (and orthogonal polynomials) that are widely accepted 
as special. 1, 3, 31-33, 35, 93, 101, 105, 111, 112, 114, 133, 140, 141, 145, 154-156, 185 


ORES — Objective Revision Evaluation Service 
A system used by Wikipedia to classify edits in potential damaging changes or changes 
made in good faith [144]. 103-105, 135, 136, 141, 142, 159 
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PL — Presentation Language 
Presentation languages are languages that encode mainly visual information, such as 
BIEX or presentation MathML. 43, 51 


pMML — Presentation MathML 
Presentational MathML refers only to the visual encoding of math formulae. For more 
information see the explanations about MathML. 22, 23, 75-77 


POM — Part-of-Math 
Is a KIEX parser developed by Abdou Youssef [402] that tags each token in the parse tree 
with additional information similar to Part-of-Speech (POS) taggers in natural languages. 
28, 32, 38, 52, 56, 93, 94, 110, 111, 151, 153, 155 


POS — Part-of-Speech 
Part-of-Speech tagging describes the process of tagging words in text with grammatical 
properties of the word. 45, 109, 160, 161, 185 


Reduce: 
Probably the first CAS from 1963 by Anthony C. Hearn [151] with a large impact on any 
other CAS that followed after. Since 2008, Reduce is open-source under the BSD license. 
5, 34, 35, 164 


Scientometrics: 
An international journal with an 5-year IF of 3.702 for quantitative aspects of the science 
of science, communication in science and science policy. According to https: //academ 
ic- accelerator . com/5- Year- Impact -Factor /Scientometrics [accessed 
2021-10-01] it is placed 18 of 227 journals in the field of library and information sciences. 
9, 19, 60 


SCSCP — Symbolic Computation Software Composability Protocol 
Is a protocol to communicate mathematical formulae between mathematical software, 
specifically CAS. It was developed as part of the SCIEnce project funded with 3 Million 
Euro by the Euorpean Union. More information can be found in the two publications 
about the project [119, 361]. 24, 26, 35, 36, 58 


semantic BIFX: 
Refers to mathematical expressions that uses semantic macros developed by B. Miller 
for the DLMF. Each of these BIFX macros is tied to a specific definition in the DLMF. 
Hence, a semantic KIEX macro represents a unique unambiguous mathematical function 
as defined in the DLMF. An alternative name for semantic KIEX is content BIFX. ix, xi, 
2, 7, 8, 10, 12, 15, 19, 22, 28, 30-33, 35, 38, 58, 93-95, 97-100, 115, 116, 133, 138, 143-146, 
149, 152-155, 174 


Semantification: 
Refers to a process that semantically enhances mathematical expressions. Other authors 
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may also refer to this via semantic enrichment [71, 270, 402]. ix, xi, 7-11, 13, 24, 54, 57-59, 
94, 95, 97, 103, 104, 106, 107, 115, 138, 144, 145, 147, 152-157, 160, 161, 193 


SIGIR — Special Interest Group on Information Retrieval 
A premier annual international conference on research and development in information 
retrieval (has a CORE rank of A*). 9, 11, 19 


SnuggleTeX: 
Is an open source Java program for converting KIEX to XML, mainly MathML. SnuggleTeX 
is one of the rare converters that offer a semantic enrichment process to content MathML 
and the only BIFX to CAS converter (supports Maxima) that is not part ofa CAS itself [251]. 
SnuggleTeX is no longer developed with the most recent version 1.2.2 from 2010. See also 
https: //www2.ph.ed.ac.uk/snuggletex [accessed 2021-10-01]. 2-4, 28, 29, 157, 158 


STEM — Science, Technology, Engineering, and Mathematics 
A group of academic disciplines. ix, xi, 2, 20, 27 


SIEX — Semantic TEX 
Semantic extension of TeX developed by Michael Kohlhase [200]. 19, 22, 30, 32, 33 


SVG — Scalable Vector Graphics 
An XML vector image format. 38, 49, 51, 52 


SymPy: 
An open-source CAS [252] written in Python. 2, 4, 5, 10, 15, 28-30, 34, 35, 146, 149, 154, 
164, 174 


t-SNE — t-distributed Stochastic Neighbor Embedding 
Is a statistical method to visualize high-dimensional data in more convenient and easy 
to analyze one-, two-, or three-dimensional plots. t-SNE uses a nonlinear dimensional 
reduction method that tries to preserve structural groups of data. The method was first 
introduced by Hinton and Roweis [154]. 69, 70 


TACAS — Tools and Alg. for the Construction and Analysis of Systems 
TACAS is a forum for researchers, developers and users interested in rigorously based 
tools and algorithms for the construction and analysis of systems (has a CORE rank of 
A). 9, 15, 116, 163, 168, 180 


TF-IDF — Term Frequency-Inverse Document Frequency 
Is a statistical measure intend to reflect the importance of tokens (e.g., words) to a docu- 
ment in a larger corpus. The underlying assumption behind the measure is that frequent 
tokens across an entire corpus are less important compared to tokens that appear fre- 
quently in single documents but rarely somewhere else. The BM25 ranking function 
bases on the principle of TF-IDF scores. 79, 83, 85, 89, 90 


TPAMI — Transactions on Pattern Analysis and Machine Intelligence 
An IEEE published top monthly journal with an 5-year IF of 25.816 and a focus on 
pattern analysis and recognition and related fields. According to https: //academic- 
accelerator . com/5- Year - Impact -Factor / jp/ IEEE -Transactions - on- 
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Pattern-Analysis-and-Machine-Intelligence [accessed 2021-10-01] it is the top 
journal in three categories and 2nd in 2 additional categories. 9, 13, 16, 97, 116, 163 


VMEXT — Visual Tool for Mathematical Expression Trees 
A visualization tool for mathematical expression trees developed by Schubotz et al. [331]. 
37, 46, 49, 50 


W3C — World Wide Web Consortium 
Is an international organization for standards for the world wide web. See www.w3.org 
[accessed 2021-06-09]. 23, 24 


WED — Wolfram Engine for Developers 
Is a free interface for the Wolfram engine (the engine behind Mathematica). Since 2019, 
this interface allows developers to interact and use most of Mathematica’s core features 
without purchasing a full license. More information are available at https: //www.wolf 
ram.com/engine/ [accessed 2021-09-07] first. 117, 127, 131 


WSDM — Web Search and Data Mining 
A premier conference on web-inspired research involving search and data mining (has a 
CORE rank of A*). 9, 97 


WWW — The Web Conference 
An annual major conference with the focus on the world wide web (has a CORE rank of 
Ax). 9, 12, 60 


XML — Extensible Markup Language 
A markup language mainly used for the representation of many different data structures. 
20, 23-25, 27, 32, 33, 37, 43, 47, 51, 52, 74, 76, 77, 157 


XSLT — Extensible Stylesheet Language (SLT) Transformation 
A language to transform XML documents. 23, 24, 26 


zbMATH — Zentralblatt MATH 
Is an international reviewing service for abstracts and articles in mathematics. zo>MATH 
provide access to the abstracts and reviews of research articles mostly in the field of pure 
and applied mathematics, see also https: //zbmath.org/ [accessed 2021-10-01]. 13, 
73-75, 78-80, 83, 84, 86, 88-90, 92, 144, 148 
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